std::bitset plus -msse4.2 equals popcnt

My research group, the Digital Evolution Lab at Michigan State University, uses computer programs as a model organism to study evolution (and, also, evolution as tool to create computer programs). In recent work with the event-driven SignalGP representation, we need to wire together an evolving computer program’s sub-modules in order to run it. We figure out what wires together by matching bitstring tags. Because we calculate the matches between a lot of bitstring tags in our experiments, it’s important for this operation to be as efficient as possible.

Gotta go fast!

In practice, calculating tag matches involves performing some operation on the two tags (e.g., bitwise xor) and then counting the number of 1’s in the resulting bitstring. Under the hood, we use uint64_t’s and uint_32_t’s to represent components of our bitstrings. So, we want to be able to the count 1’s in those types quickly.

Luckily, x86 CPUs provide a popcnt machine-level instruction as part of the SSE 4.2 instruction set, which does all the work to the count 1’s in a register in just one clock cycle. However, directly writing popcnt into our source code was pretty crufty. Here’s an idea of what that looks like.

This snippet was pulled from an older version of our lab’s Empirical C++ library.

#include <nmmintrin.h>

...

#ifdef __SSE4_2__
          bit_count += _mm_popcnt_u64(bit_set[i]); // needs CPU with SSE4.2
#else
          // adapted from https://en.wikipedia.org/wiki/Hamming_weight
          uint64_t x = bit_set[i];
          // put count of each 2 bits into those 2 bits
          x -= (x >> 1) & 0x5555555555555555;
          // put count of each 4 bits into those 4 bits
          x = (x & 0x3333333333333333) + ((x >> 2) & 0x3333333333333333);
          //put count of each 8 bits into those 8 bits
          x = (x + (x >> 4)) & 0x0f0f0f0f0f0f0f0f;
          // returns left 8 bits of x + (x<<8) + (x<<16) + (x<<24) + ...
          bit_count += (x * 0x0101010101010101) >> 56;
#endif

Up top, we include <nmmintrin.h>. This header isn’t part of the C++ standard, so we’re potentially setting ourselves up for a big headache if we try to compile on a system where the header isn’t available.

Then, we have to use a preprocessor directive (a.k.a. a macro) to separately handle cases where SSE 4.2 instructions are and are not available. This means we have to maintain twice the code! This is especially annoying because the non-SSE 4.2 code involves some hardcore inscrutable bitmagic.

To boot, macros are increasingly considered harmful a la goto (or at least yucky) in modern C++ so we’d rather do without them anyways.

What to do?

🔗 🐤 Peep the Intermediate Representation

The C++ standard library provides a class called std::bitset that represents a bit sequence and provides a count member function that reports the number of 1’s the sequence contains… exactly what we want! It would be really slick to be able to shove the bits we want to count into a std::bitset and then call count, but will it go fast? Will we have to pay overhead to construct a std::bitset from a uint32_t or a uint64_t? Will std::bitset’s count method use a popcnt machine instruction or rely on a slower method?

Let’s find out!

Here’s a Compiler Explorer example that shows what wrapping a uint in std::bitset and then calling count compiles down to with Clang. Note, in particular, the intermediate representation code on the right that corresponds to countbits32 and countbits64.

There’s no overhead with a uint64_t (just a plain ol’ popcnt and then a return). For uint32_t we have to execute a single mov instruction in addition to the popcnt, which isn’t too bad at all. Success!

Here’s the same example, but compiled with GCC. We get similar results.

The key, it turns out, to getting popcnt from std::bitset::count is compiling with the -mmsse4.2 flag. Try taking it out of the examples above. You should be able to see the intermediate representation that the source compiles to balloon up.

Although this code won’t be as fast if SSE 4.2 instructions aren’t available, we can depend on the standard library implementors to provide a correct (as well as reasonably performant) implementation of std::bitset::count so our code at least doesn’t break. We wouldn’t have been able to use popcnt anyways if SSE 4.2 instructions weren’t available, so we’re not losing much, if anything, in this case.

🔗 ✨ Source Code Transformation ✨

The counterpart to the big, ugly code snippet above after we started using std::bitset really shows off how much of our source’s complexity we were able to hand off to the standard library implementors.

This snippet was pulled from an newer version of our lab’s Empirical C++ library.

    // when compiling with -O3 and -msse4.2,
    // this compiles to a hardware POPCNT instruction
    std::bitset<FIELD_BITS> std_bs(bit_set[i]);
    bit_count += std_bs.count();

All without sacrificing performance… nice!

🔗 Shout Out

… to Santiago Rodriguez-Papa @rodsan0, who suggested the -msse4.2 compiler flag and has been hard at work maintaining Empirical’s BitSet and BitVector. Thank you!

🔗 Let’s Chat

Comments? Questions?

I started a twitter thread (right below) so we can chat!! ☎️ ☎️ ☎️

& of course I threw my hat in the ring to share how the -mmsse4.2 flag can help you compile down your std::bitset operations to a single machine-level instruction 🤦 🤷https://t.co/HWGyaT5Dpo
— Matthew A Moreno (@MorenoMatthewA) June 15, 2020

CSE 491

Advanced C++ Seminar at Michigan State University

std::bitset plus -msse4.2 equals popcnt

🔗 🐤 Peep the Intermediate Representation

🔗 ✨ Source Code Transformation ✨

🔗 Shout Out

🔗 Let’s Chat