I like benchmarking, mainly because you learn something new about a language, a stack or an approach. So, we lead with this goal.
This time we explore differences in implementing string comparison in NodeJS. In particular, we test three different ways to find vowels on a string. We learn that for short strings using an if-statement greatly outperforms using NodeJS's indexOf() method. In longer strings, using indexOf() is possibly the way to go.
Our test subject
Our main test subject is a quote from Dan Simmons's Hyperion Cantos:
"In the beginning was the Word. Then came the fucking word processor. Then came the thought processor. Then came the death of literature. And so it goes."
- Node v4.2.1
- 7.7 GiB Memory
- Intel® Core™ i5-3210M CPU @ 2.50GHz × 4
- Ubuntu 15.04 (64-bit)
The indexOf() method of the Array prototype returns the index within a String object that references the first occurrence of the specified value. If the value is not found it returns -1. Our implementation looks like this:
This test takes 169,350 milliseconds (average of 10 tests). Let's see if we can optimize this code.
Using binary search
We can potentially improve the time it takes to search the vowel from the vowels array. We can do this by binary-searching for the vowel. In theory, this should improve the time it takes to compare each character. Let's give it a try:
This approach took 177,990 milliseconds (average of 10 tests). Damn... We have, in fact, made our approach worse!
Maybe the implementation of indexOf() is more efficient. Taking a look at the the implementation of string search in NodeJS we might think that NodeJS has replaced a naive search for a more efficient solution in V8.
Well, look closer. This optimization comes into play if the pattern's length is greater than 8 characters. If greater than 8 characters NodeJS will first search for the pattern at the beginning of the string, if not found there it will use a Boyer-Moore-Horspool approach. If the pattern is less than 8 characters it will use a naive and linear search. Our pattern (vowel) is only one character in length, defaulting to a naive search every time.
So, what the hell is going on? The linear search used by indexOf() is compiled and runs lower in the NodeJS stack, making it more efficient than our binary search implementation. Our binarySearch() code does a lot of work in terms of its recursive object creation and evaluation, which introduces spatial and temporal overhead higher in the stack that is completely unnecessary for a short string as "aeiou".
Using an if-statement
So, can we do better? Of course we can! Because the cases we are testing against are only a few ("a", "e", "i", "o", "u") we can leverage the speed of NodeJS's compiled evaluation and, ugly as it seems, use an if-statement:
This approach takes 62,382 milliseconds (average of 10 tests) resulting in a massive improvement.
What about longer strings?
So far, we have found that the most optimal approach is to use an if statement. But how do our three approaches perform on longer strings? Let's see.
In short strings an if-statement can be the way to go in terms of performance.
As the test string becomes larger (e.g. ~200K characters) the difference between using an if-statement or indexOf() becomes less important, when compared to our binary search implementation.
Even longer strings (e.g. ~1 million characters) make this difference more trivial. Notice, however that an if statement will, still perform slightly better.
As the test string becomes larger the differences between using an if-statement or indexOf() will become irrelevant. Simply because we will approach the expected O(n) complexity of evaluating each character in the string being tested. In the case of our binary search implementation spatial and time complexity will grow as the string becomes larger, making this overhead even worse.
What are the lessons?
For the moment stick with indexOf() for similar cases and go grab a beer.
The language you use to implement your application matters. A smart-ass improvement on your code might actually do more harm than good. In interpreted languages this is more common; because the code has many layers. In such a case a naive implementation lower in the stack might be more efficient than a theoretically better implementation built on a higher level.