My research group, the Digital Evolution Lab at Michigan State University, uses computer programs as a model organism to study evolution (and, also, evolution as tool to create computer programs). In recent work with the event-driven SignalGP representation, we need to wire together an evolving computer program’s sub-modules in order to run it. We figure out what wires together by matching bitstring tags. Because we calculate the matches between a lot of bitstring tags in our experiments, it’s important for this operation to be as efficient as possible.

In practice, calculating tag matches involves performing some operation on the two tags (e.g., bitwise xor) and then counting the number of 1’s in the resulting bitstring. Under the hood, we use `uint64_t`’s and `uint_32_t`’s to represent components of our bitstrings. So, we want to be able to the count 1’s in those types quickly.

Luckily, x86 CPUs provide a `popcnt` machine-level instruction as part of the SSE 4.2 instruction set, which does all the work to the count 1’s in a register in just one clock cycle. However, directly writing `popcnt` into our source code was pretty crufty. Here’s an idea of what that looks like.

This snippet was pulled from an older version of our lab’s Empirical C++ library.

``````#include <nmmintrin.h>

...

#ifdef __SSE4_2__
bit_count += _mm_popcnt_u64(bit_set[i]); // needs CPU with SSE4.2
#else
uint64_t x = bit_set[i];
// put count of each 2 bits into those 2 bits
x -= (x >> 1) & 0x5555555555555555;
// put count of each 4 bits into those 4 bits
x = (x & 0x3333333333333333) + ((x >> 2) & 0x3333333333333333);
//put count of each 8 bits into those 8 bits
x = (x + (x >> 4)) & 0x0f0f0f0f0f0f0f0f;
// returns left 8 bits of x + (x<<8) + (x<<16) + (x<<24) + ...
bit_count += (x * 0x0101010101010101) >> 56;
#endif
``````

Up top, we include `<nmmintrin.h>`. This header isn’t part of the C++ standard, so we’re potentially setting ourselves up for a big headache if we try to compile on a system where the header isn’t available.

Then, we have to use a preprocessor directive (a.k.a. a macro) to separately handle cases where SSE 4.2 instructions are and are not available. This means we have to maintain twice the code! This is especially annoying because the non-SSE 4.2 code involves some hardcore inscrutable bitmagic.

To boot, macros are increasingly considered harmful a la `goto` (or at least yucky) in modern C++ so we’d rather do without them anyways.

What to do?

## 🔗 🐤 Peep the Intermediate Representation

The C++ standard library provides a class called `std::bitset` that represents a bit sequence and provides a `count` member function that reports the number of 1’s the sequence contains… exactly what we want! It would be really slick to be able to shove the bits we want to count into a `std::bitset` and then call `count`, but will it go fast? Will we have to pay overhead to construct a `std::bitset` from a `uint32_t` or a `uint64_t`? Will `std::bitset`’s `count` method use a `popcnt` machine instruction or rely on a slower method?

Let’s find out!

Here’s a Compiler Explorer example that shows what wrapping a `uint` in `std::bitset` and then calling `count` compiles down to with Clang. Note, in particular, the intermediate representation code on the right that corresponds to `countbits32` and `countbits64`.

There’s no overhead with a `uint64_t` (just a plain ol’ `popcnt` and then a `return`). For `uint32_t` we have to execute a single `mov` instruction in addition to the `popcnt`, which isn’t too bad at all. Success!

Here’s the same example, but compiled with GCC. We get similar results.

The key, it turns out, to getting `popcnt` from `std::bitset::count` is compiling with the `-mmsse4.2` flag. Try taking it out of the examples above. You should be able to see the intermediate representation that the source compiles to balloon up.

Although this code won’t be as fast if SSE 4.2 instructions aren’t available, we can depend on the standard library implementors to provide a correct (as well as reasonably performant) implementation of `std::bitset::count` so our code at least doesn’t break. We wouldn’t have been able to use `popcnt` anyways if SSE 4.2 instructions weren’t available, so we’re not losing much, if anything, in this case.

## 🔗 ✨ Source Code Transformation ✨

The counterpart to the big, ugly code snippet above after we started using `std::bitset` really shows off how much of our source’s complexity we were able to hand off to the standard library implementors.

This snippet was pulled from an newer version of our lab’s Empirical C++ library.

``````    // when compiling with -O3 and -msse4.2,
// this compiles to a hardware POPCNT instruction
std::bitset<FIELD_BITS> std_bs(bit_set[i]);
bit_count += std_bs.count();
``````

All without sacrificing performance… nice!

## 🔗 Shout Out

… to Santiago Rodriguez-Papa @rodsan0, who suggested the `-msse4.2` compiler flag and has been hard at work maintaining Empirical’s `BitSet` and `BitVector`. Thank you!