Comparing C++ range libraries for filter+reverse case with non-trivial lambda

Comparing C++ range libraries for filter+reverse case with non-trivial lambda

C++ ranges are awesome as they often make the code simpler and easier to reason about. But how they compare against each other? To check that I did a micro-benchmark of c++ ranges for one particular use case I encountered in practice.

The scenario I tested was as follows:

I have an input data set that needs to be filtered based on non-trivial lambda.
Only a subset of this data is needed.
Then an index is added to the elements.
The last step is to reverse the sequence.

The repo with source code is here https://github.com/serpent7776/cpp-ranges-bench

It’s basically filter+take+enumerate+reverse, but there are two tricky things here:

The non-triviality of the lambda being used for filtering. In the example code, such a lambda is often a trivial function. What happens when the cost of invoking such lambda is not not negligible?
Filter+reverse is a tricky combination. Filtering yields a result set of unknown size, which means that subsequent reverse cannot be done easily in the general case.

The ranges library needs to take both of these situations into account, otherwise it might introduce noticeable overhead.

Let’s check it out.

The code uses catch2 to test for correctness and performance.
Catch2 is a unit testing framework for C++, but it also provides basic micro-benchmarking features.
Unfortunately, it doesn’t provide any way to visualise the results, so I created a script and submitted PR adding plotting capabilities to Catch.

Tested implementations

I have reviewed the following implementations:

clike – a simple C-like implementation. This still uses c++ vector and its methods, but it’s just a for loop with if statements. It collects the result in a vector and when it’s done, it reverses it in-place.
algorithms – Manual implementation using STL algorithms. Instead of operating on the elements themselves, it allocates an extra vector with indices and operates on that. It only copies the elements in the last step.
boost_adaptors – Uses boost::adaptors. It provides very minimal feature set, but it’s surprisingly capable. Due to the limitations of adaptors library, it allocates an extra vector of filtered elements and operates on that.
rangesv3 – Uses ranges-v3 – range library for C++14/17/20. Pretty close to the implementation using std::ranges, but a bit more complex, because it distinguishes between actions and views. This method uses the reverse action with views and to_vector, which might allocate an extra vector.
stdranges – Uses C++23 ranges. This is the shortest and cleanest implementation. The compiler I was using didn’t have to implemented, so I had to roll my own. This may have affected the results.
fluxranges – Uses flux, a C++20 library for sequence-oriented programming. Quite noisy compared to other range implementations.

I used the following versions of the libraries:

ranges-v3 0.12
fluxranges faf2e0f8b9f616001890a87b613f47313443f8e5
g++ stdlib, g++ (GCC) 13.2.1 20230801
All compiled in C++23 mode with the -O2 flag.

Test cases

I have generated some random data using the faker python library. The data is a set of usernames with an id and a set of connected names. I will search for usernames that match some criteria which involves looking through the connected names. This should be enough to consider the filter criteria as non-trivial.
To do the comparisons I have generated 10 000 rows of data.

The test cases I will check are:

all ismiths – select all items that have “ismith” as connection. There are 15 such entries in total.
5 ismiths – select at most 5 items that have “ismith” as connection.
empty result set – searches for an item that doesn’t exist, yielding no results.
early single item – searches for all items that have “henry79” as connection. There’s only one such item, early in the sequence.
late single item – search for all items that have “emilymclaughlin” as connection. There’s only one such item, late in the sequence.
every other item – accept every other item, while also performing some operation on the input data.

Results

In all cases flux was noticeably slower than other methods.
Ranges-v3 was generally on par with other solutions. There are two cases where it’s slower: ‘all ismiths’ and ‘early single item’.
boost adaptors are overall quite good. In ‘all ismiths’, ’empty result set’ and ‘late single item’ are on par with fastest implementation and just slightly slower in other cases.
C-like solution was always on par or better than other solutions.

Some more details

Let’s try to see some more details. I will look at the ‘early single item’ test. This case is interesting, because there are big differences between the implementations.

First, let’s run the test case for a few more times. I’ll make the test cases run 1000 times by adding this code to each implementation section.

auto _ = GENERATE(range(1, 1000));

This uses Catch2 data generator to execute the section multiple times, each time with a different value. In this case I’m not interested in the value itself, I’m only doing this to run the section again.

Now let’s run perf stat on it. To collect the data I’ll will only run this single test case and only one implementation at a time, ignoring the benchmarks. I do this by invoking the compiled binary with the test-case name, the -c option with a section name that specifies which implementation to use and the –skip-benchmarks option so that no benchmarks are run. You can read more about Catch2 command line usage.
perf stat is invoked with -o stats –append, which appends the perf data to stats file. Read more about perf stat usage.

: > stats # truncate stats file
for t in clike algorithms boost_adaptors rangesv3 stdranges fluxranges; do
perf stat -o stats –append ./read “early single item” -c $t –skip-benchmarks ;
done

I will then convert the stats file with a perf-stat-to-sc script I wrote. It will create a sc-im spreadsheet, which will make it easier to compare the results.

implementation
context_switches
cpu_migrations
page_faults
cycles
instructions
insn_per_cycle
branches
branch_misses
backend_bound
bad_speculation
frontend_bound
retiring
real
user
sys

stdranges
0
0
50,796
11,279,370,035
29,524,803,591
29,524,803,591
7,633,968,112
106,521,026
5.9
19.5
28.8
45.8
3.646195512
3.502992000
0.136365000

clike
0
0
1,411
10,338,071,101
28,584,460,227
28,584,460,227
7,385,993,023
86,580,500
5.8
16.5
28.8
48.9
3.319227798
3.255089000
0.056539000

boost_adaptors
0
0
1,398
11,255,210,775
29,537,931,422
29,537,931,422
7,634,756,489
106,986,997
4.4
26.6
26.7
42.4
3.533653076
3.448212000
0.066610000

fluxranges
0
0
252,772
13,048,628,986
33,858,601,153
33,858,601,153
8,604,259,911
123,154,769
7.3
20.8
26.5
45.5
4.400705838
4.032191000
0.359219000

rangesv3
0
0
1,395
12,079,299,571
30,489,816,925
30,489,816,925
7,883,204,145
125,824,949
4.5
29.1
25.8
40.6
3.805896934
3.701721000
0.082913000

algorithms
0
0
7,620
11,329,769,168
29,617,205,033
29,617,205,033
7,644,573,504
106,472,532
6.1
18.8
28.8
46.3
3.656246734
3.548037000
0.099826000

The first thing to notice is the difference in the number of page faults. c-like, rangesv3 and boost adaptors had about 1400 faults. For custom algorithms it was ~7600. For std::ranges the number of faults was an order of magnitude larger (~50,000). For flux it was another order of magnitude larger (~250,000). The differences in cycles and instructions are quite small. It’s a bit bigger for flux though. The C-like implementation executed slightly fewer instructions than the other implementations. The number of branches is also similar, slightly higher for flux. There are bigger differences in branch misses. The C-like implementation has the least number of misses. Flux and rangesv3 have the most misses. Flux and std::ranges have spent more time in the kernel.

But this collects stats from the entire program run. The issue is that most of the time is spent on reading and parsing the input file, which is the same in all cases. I want to collect stats only for the single implementation function. For this I can use valgrind’s cachegrind with additional code instrumentation.
I will add CACHEGRIND_START_INSTRUMENTATION and CACHEGRIND_STOP_INSTRUMENTATION just before and after the implementation function I’m interested in and then run the binary with –instr-at-start=no which will disable instrumentation at startup and will wait for the first CACHEGRIND_START_INSTRUMENTATION to enable it.
Read more about cachegrind usage.

SECTION(“clike”) {
auto _ = GENERATE(range(1, 1000));
CACHEGRIND_START_INSTRUMENTATION;
const std::vector<Out> found = clike(data, accept, max_items);
CACHEGRIND_STOP_INSTRUMENTATION;
REQUIRE(found == expected);
}
valgrind –tool=cachegrind –instr-at-start=no ./read ‘early single item’ -c fluxranges

This is the data I got:

implementation
instructions executed

clike
246 384

algorithms
416 548 909

boost_adaptors
336 002 843

rangesv3
671 686 872

stdranges
336 037 633

fluxranges
3 241 270 519

This confirms that C-like executes far fever instructions than other implementations and flux ranges, by far, the most.

Conclusions

Ranges are really cool. They often make the code easier to read and reason about. But they do not always fit all the use cases. Sometimes, in some very specific cases, a for loop might give more control that is needed to get some better results. In general though, it’s usually better to start with a ranges, as they will be good enough in most cases.

Please follow and like us:
Pin Share