std::bitset plus -msse4.2 equals popcnt
My research group, the Digital Evolution Lab at Michigan State University, uses computer programs as a model organism to study evolution (and, also, evolution as tool to create computer programs). In recent work with the event-driven SignalGP representation, we need to wire together an evolving computer program’s sub-modules in order to run it. We figure out what wires together by matching bitstring tags. Because we calculate the matches between a lot of bitstring tags in our experiments, it’s important for this operation to be as efficient as possible.
In practice, calculating tag matches involves performing some operation on the two tags (e.g., bitwise xor) and then counting the number of 1’s in the resulting bitstring.
Under the hood, we use uint64_t
’s and uint_32_t
’s to represent components of our bitstrings.
So, we want to be able to the count 1’s in those types quickly.
Luckily, x86 CPUs provide a popcnt
machine-level instruction as part of the SSE 4.2 instruction set, which does all the work to the count 1’s in a register in just one clock cycle.
However, directly writing popcnt
into our source code was pretty crufty.
Here’s an idea of what that looks like.
This snippet was pulled from an older version of our lab’s Empirical C++ library.
#include <nmmintrin.h>
...
#ifdef __SSE4_2__
bit_count += _mm_popcnt_u64(bit_set[i]); // needs CPU with SSE4.2
#else
// adapted from https://en.wikipedia.org/wiki/Hamming_weight
uint64_t x = bit_set[i];
// put count of each 2 bits into those 2 bits
x -= (x >> 1) & 0x5555555555555555;
// put count of each 4 bits into those 4 bits
x = (x & 0x3333333333333333) + ((x >> 2) & 0x3333333333333333);
//put count of each 8 bits into those 8 bits
x = (x + (x >> 4)) & 0x0f0f0f0f0f0f0f0f;
// returns left 8 bits of x + (x<<8) + (x<<16) + (x<<24) + ...
bit_count += (x * 0x0101010101010101) >> 56;
#endif
Up top, we include <nmmintrin.h>
.
This header isn’t part of the C++ standard, so we’re potentially setting ourselves up for a big headache if we try to compile on a system where the header isn’t available.
Then, we have to use a preprocessor directive (a.k.a. a macro) to separately handle cases where SSE 4.2 instructions are and are not available. This means we have to maintain twice the code! This is especially annoying because the non-SSE 4.2 code involves some hardcore inscrutable bitmagic.
To boot, macros are increasingly considered harmful a la goto
(or at least yucky) in modern C++ so we’d rather do without them anyways.
What to do?
🔗 🐤 Peep the Intermediate Representation
The C++ standard library provides a class called std::bitset
that represents a bit sequence and provides a count
member function that reports the number of 1’s the sequence contains… exactly what we want!
It would be really slick to be able to shove the bits we want to count into a std::bitset
and then call count
, but will it go fast?
Will we have to pay overhead to construct a std::bitset
from a uint32_t
or a uint64_t
?
Will std::bitset
’s count
method use a popcnt
machine instruction or rely on a slower method?
Let’s find out!
Here’s a Compiler Explorer example that shows what wrapping a uint
in std::bitset
and then calling count
compiles down to with Clang.
Note, in particular, the intermediate representation code on the right that corresponds to countbits32
and countbits64
.
There’s no overhead with a uint64_t
(just a plain ol’ popcnt
and then a return
).
For uint32_t
we have to execute a single mov
instruction in addition to the popcnt
, which isn’t too bad at all.
Success!
Here’s the same example, but compiled with GCC. We get similar results.
The key, it turns out, to getting popcnt
from std::bitset::count
is compiling with the -mmsse4.2
flag.
Try taking it out of the examples above.
You should be able to see the intermediate representation that the source compiles to balloon up.
Although this code won’t be as fast if SSE 4.2 instructions aren’t available, we can depend on the standard library implementors to provide a correct (as well as reasonably performant) implementation of std::bitset::count
so our code at least doesn’t break.
We wouldn’t have been able to use popcnt
anyways if SSE 4.2 instructions weren’t available, so we’re not losing much, if anything, in this case.
🔗 ✨ Source Code Transformation ✨
The counterpart to the big, ugly code snippet above after we started using std::bitset
really shows off how much of our source’s complexity we were able to hand off to the standard library implementors.
This snippet was pulled from an newer version of our lab’s Empirical C++ library.
// when compiling with -O3 and -msse4.2,
// this compiles to a hardware POPCNT instruction
std::bitset<FIELD_BITS> std_bs(bit_set[i]);
bit_count += std_bs.count();
All without sacrificing performance… nice!
🔗 Shout Out
… to Santiago Rodriguez-Papa @rodsan0, who suggested the -msse4.2
compiler flag and has been hard at work maintaining Empirical’s BitSet
and BitVector
.
Thank you!
🔗 Let’s Chat
Comments? Questions?
I started a twitter thread (right below) so we can chat!! ☎️ ☎️ ☎️
& of course I threw my hat in the ring to share how the -mmsse4.2 flag can help you compile down your std::bitset operations to a single machine-level instruction 🤦 🤷https://t.co/HWGyaT5Dpo
— Matthew A Moreno (@MorenoMatthewA) June 15, 2020