Loop Unrolling in JavaScript?

RMAG news

JavaScript can feel very removed from the hardware it runs on, but thinking low-level can still be useful in limited cases.

A recent post of Kafeel Ahmad on loop optimization detailed a number of loop performance improvement techniques. That article got me thinking about the topic.

Premature Optimization

Just to get this out of the way, this is a technique very few will ever need to consider in web development. Also, focusing on optimization too early can make code harder to write and much harder maintain. Taking a peek at low-level techniques can give us insight into our tools and the work in general, even if we can’t apply that knowledge directly.

What is Loop Unrolling?

Loop unrolling basically duplicates the logic inside a loop so you perform multiple operations during each, well, loop. In specific cases, making the code in the loop longer can make it faster.

By intentionally performing some operations in groups rather than one-by-one, the computer may be able to operate more efficiently.

Unrolling Example

Let’s take a very simple example: summing values in an array.

// 1-to-1 looping
const simpleSum = (data) => {
let sum = 0;
for(let i=0; i < data.length; i += 1) {
sum += data[i];
}
return sum;
};

const parallelSum = (data) => {
let sum1 = 0;
let sum2 = 0;
for(let i=0; i < data.length; i += 2) {
sum1 += data[i];
sum2 += data[i + 1];
}
return sum1 + sum2;
};

This may look very strange at first. We’re managing more variables and performing additional operations that don’t happen in the simple example. How can this be faster?!

Measuring the Difference

I ran some comparisons over a variety of data sizes and multiple runs, as well as sequential or interleaved testing. The parallelSum performance varied, but was almost always better, excepting some odd results for very small data sizes. I tested this using RunJS, which is built on Chrome’s V8 engine.

Different data sizes gave very roughly these results:

Small (< 10k): Rarely any difference
Medium (10k-100k): Typically ~20-80% faster
Large (> 1M): Consistently twice as fast

Then I created a JSPerf with 1 million records to try across different browsers. Try it yourself!

Chrome ran parallelSum twice as fast as simpleSum, as expected from the RunJS testing.

Safari was almost identical to Chrome, both in percents and operations per second.

Firefox on the same system performed almost the same for simpleSum but parallelSum was only about 15% faster, not twice as fast.

This variation sent me looking for more information. While it’s nothing definitive, I found a StackOverflow comment from 2016 discussing some of the JS engine issues with loop unrolling. It’s an interesting look at how engines and optimizations can affect code in ways we don’t expect.

Variations

I tried a third version as well, which added two values in a single operation to see if there was a noticeable difference between one variable and two.

const parallelSum = (data) => {
let sum = 0
for(let i=0; i < data.length; i += 2) {
sum += data[i] + data[i + 1];
}
return sum;
};

Short answer: No. The two “parallel” versions were within the reported margin of error of each other.

So, How Does it Work?

While JavaScript is single-threaded, the interpreters, compilers, and hardware underneath can perform optimizations for us when certain conditions are met.

In the simple example, the operation needs the value i to know what data to fetch, and it needs the latest value of sum to update. Because both of these change in each loop, the computer has to wait for the loop to complete to get more data. While it may seem obvious to us what i += 1 will do, the computer mostly understands “the value will change, check back later”, so it has difficulty optimizing.

Our parallel versions load multiple data entries for each value of i. We still depend on sum for each loop, but we can load and process twice as much data per cycle. But that doesn’t mean it runs twice as fast.

Deeper Dive

To understand why loop unrolling works we look to the low-level operation of a computer. Processors with super-scalar architectures can have multiple pipelines to perform simultaneous operations. They can support out-of-order execution so operations that don’t depend on each other can happen as soon as possible. For some operations, SIMD can perform one action on multiple pieces of data at once. Beyond that we start getting into caching, data fetching, and branch prediction…

But this is a JavaScript article! We’re not going that deep. If you want to know more about processor architectures, Anandtech has some excellent Deep Dives.

Limits and Drawbacks

Loop unrolling is not magic. There are limits and diminishing returns that appear because of program or data size, operation complexity, computer architecture, and more. But we’ve only tested one or two operations, and modern computers often support four or more threads.

To try some larger increments, I made another JSPerf with 1, 2, 4, and 10 records and ran it on an Apple M1 Max MacBook Pro running macOS 14.5 Sonoma, and an AMD Ryzen 9 3950X PC running Windows 11.

Ten records at a time was 2.5-3.5x faster than the base loop, but only 12-15% faster than processing four records on the Mac. On the PC we still saw the 2x improvement between one to two records, but ten records was just 2% faster than four records, which I would not have predicted for a 16-core processor.

Platforms and Updates

These different results remind us to be careful with optimization. Optimizing for your computer could create a worse experience on less-capable or just different hardware. Performance or functionality issues for older or entry-level hardware is a common issue when developers work on fast, powerful machines, and it’s something I’ve been tasked with multiple times in my career.

For some performance scale, a currently-available entry-level Chromebook from HP has an Intel Celeron N4120 processor. This is roughly equivalent to my 2013 Core i5-4250U MacBook Air. It has just one ninth the performance of the M1 Max in a synthetic benchmark. On that 2013 MacBook Air, running the latest version of Chrome, the 4-record function was faster than the 10-record, but still only 60% faster than the single-record function!

Browsers and standards are constantly changing, too. A routine browser update or a different processor architecture could make optimized code slower than a regular loop. When you find yourself deeply optimizing, you may need to ensure your optimization is relevant to your consumers, and that it stays relevant.

It reminds me of the book High Performance JavaScript by Nicholas Zakas, which I read back in 2012. It was a great book and contained a lot of insight. However, by 2014 a number of the significant performance issues identified in the book had been resolved or substantially reduced by browser engine updates, and we were able to focus more effort on writing maintainable code.

If you are trying to stay on the edge of performance optimization, be prepared for change and regular validation.

Lessons from the Past

While researching this topic I came across a Linux Kernel Mailing List thread from the year 2000 about removing some loop unrolling optimizations which ultimately improved the application performance. It included this still-relevant point (emphasis mine):

The bottom line is that our intuitive assumptions of what’s fast and what isn’t can often be wrong, especially given how much CPU’s have changed over the past couple of years.
Theodore Ts’o

Conclusion

There are times you may need to squeeze performance out of a loop, and if you are processing enough items, this could be one of the ways you do that. It’s good to know about these kind of optimizations, but for most work, You Aren’t Gonna Need It™.

Still I hope you’ve enjoyed my rambling, and that maybe in the future your memory will be jogged about performance optimization considerations.

Thanks for reading!

Please follow and like us:
Pin Share