std::bitset plus -msse4.2 equals popcnt
My research group, the Digital Evolution Lab at Michigan State University, uses computer programs as a model organism to study evolution (and, also, evolution as tool to create computer programs). In recent work with the event-driven SignalGP representation, we need to wire together an evolving computer program’s sub-modules in order to run it. We figure out what wires together by matching bitstring tags. Because we calculate the matches between a lot of bitstring tags in our experiments, it’s important for this operation to be as efficient as possible.
In practice, calculating tag matches involves performing some operation on the two tags (e.g., bitwise xor) and then counting the number of 1’s in the resulting bitstring.
Under the hood, we use
uint_32_t’s to represent components of our bitstrings.
So, we want to be able to the count 1’s in those types quickly.
Luckily, x86 CPUs provide a
popcnt machine-level instruction as part of the SSE 4.2 instruction set, which does all the work to the count 1’s in a register in just one clock cycle.
However, directly writing
popcnt into our source code was pretty crufty.
Here’s an idea of what that looks like.
This snippet was pulled from an older version of our lab’s Empirical C++ library.
#include <nmmintrin.h> ... #ifdef __SSE4_2__ bit_count += _mm_popcnt_u64(bit_set[i]); // needs CPU with SSE4.2 #else // adapted from https://en.wikipedia.org/wiki/Hamming_weight uint64_t x = bit_set[i]; // put count of each 2 bits into those 2 bits x -= (x >> 1) & 0x5555555555555555; // put count of each 4 bits into those 4 bits x = (x & 0x3333333333333333) + ((x >> 2) & 0x3333333333333333); //put count of each 8 bits into those 8 bits x = (x + (x >> 4)) & 0x0f0f0f0f0f0f0f0f; // returns left 8 bits of x + (x<<8) + (x<<16) + (x<<24) + ... bit_count += (x * 0x0101010101010101) >> 56; #endif
Up top, we include
This header isn’t part of the C++ standard, so we’re potentially setting ourselves up for a big headache if we try to compile on a system where the header isn’t available.
Then, we have to use a preprocessor directive (a.k.a. a macro) to separately handle cases where SSE 4.2 instructions are and are not available. This means we have to maintain twice the code! This is especially annoying because the non-SSE 4.2 code involves some hardcore inscrutable bitmagic.
What to do?
🔗 🐤 Peep the Intermediate Representation
The C++ standard library provides a class called
std::bitset that represents a bit sequence and provides a
count member function that reports the number of 1’s the sequence contains… exactly what we want!
It would be really slick to be able to shove the bits we want to count into a
std::bitset and then call
count, but will it go fast?
Will we have to pay overhead to construct a
std::bitset from a
uint32_t or a
count method use a
popcnt machine instruction or rely on a slower method?
Let’s find out!
Here’s a Compiler Explorer example that shows what wrapping a
std::bitset and then calling
count compiles down to with Clang.
Note, in particular, the intermediate representation code on the right that corresponds to
There’s no overhead with a
uint64_t (just a plain ol’
popcnt and then a
uint32_t we have to execute a single
mov instruction in addition to the
popcnt, which isn’t too bad at all.
Here’s the same example, but compiled with GCC. We get similar results.
The key, it turns out, to getting
std::bitset::count is compiling with the
Try taking it out of the examples above.
You should be able to see the intermediate representation that the source compiles to balloon up.
Although this code won’t be as fast if SSE 4.2 instructions aren’t available, we can depend on the standard library implementors to provide a correct (as well as reasonably performant) implementation of
std::bitset::count so our code at least doesn’t break.
We wouldn’t have been able to use
popcnt anyways if SSE 4.2 instructions weren’t available, so we’re not losing much, if anything, in this case.
🔗 ✨ Source Code Transformation ✨
The counterpart to the big, ugly code snippet above after we started using
std::bitset really shows off how much of our source’s complexity we were able to hand off to the standard library implementors.
This snippet was pulled from an newer version of our lab’s Empirical C++ library.
// when compiling with -O3 and -msse4.2, // this compiles to a hardware POPCNT instruction std::bitset<FIELD_BITS> std_bs(bit_set[i]); bit_count += std_bs.count();
All without sacrificing performance… nice!
🔗 Shout Out
… to Santiago Rodriguez-Papa @rodsan0, who suggested the
-msse4.2 compiler flag and has been hard at work maintaining Empirical’s
🔗 Let’s Chat
I started a twitter thread (right below) so we can chat!! ☎️ ☎️ ☎️
& of course I threw my hat in the ring to share how the -mmsse4.2 flag can help you compile down your std::bitset operations to a single machine-level instruction 🤦 🤷https://t.co/HWGyaT5Dpo— Matthew A Moreno (@MorenoMatthewA) June 15, 2020