## Tuesday, May 16, 2017

### Integer division is slow

The CppCon 2016 talk “I Just Wanted a Random Integer!” benchmarks randomization functionality from the C++ standard library (using GCC 5.1). There is one surprising result — the loop
std::random_device entropySource;
std::mt19937 randGenerator(entropySource());
std::uniform_int_distribution<int> theIntDist(0, 99);

for (int i = 0; i < 1000000000; i++) {
volatile auto r = theIntDist(randGenerator);
}

need 23.4 seconds to run while
std::random_device entropySource;
std::mt19937 randGenerator(entropySource());

for (int i = 0; i < 1000000000; i++) {
std::uniform_int_distribution<int> theIntDist(0, 99);
volatile auto r = theIntDist(randGenerator);
}

run in 5.1 seconds. But the latter should intuitively be slower as it does more in the loop...

### Code expansion

The functionality in the standard library is implemented using template magic, but the compiler’s view of the code after inlining and basic simplification is that
std::uniform_int_distribution<int> theIntDist(0, 99);

is just defining and initializing a structure
struct {
int a, b;
} theIntDist;
theIntDist.a = 0;
theIntDist.b = 99;

while the call
volatile auto r = theIntDist(randGenerator);

is expanded to the equivalent of
uint64_t ret;
uint64_t urange = theIntDist.b - theIntDist.a;
if (0xffffffff > urange) {
const uint64_t uerange = urange + 1;
const uint64_t scaling = 0xffffffff / uerange;
const uint64_t past = uerange * scaling;
do {
ret = mersenne_twister_engine(randGenerator);
} while (ret >= past);
ret /= scaling;
} else {
...
uniform_int_distribution(&theIntDist, randGenerator);
...
ret = ...
}
volatile int r = ret + theIntDist.a;

where I have used ... for code that is not relevant for the rest of the discussion.

### Optimization differences

It is now easy to see why the second case is faster — creating theIntDist in the loop makes it trivial for the compiler to determine that urange has the value 99, and the code simplifies to
uint64_t ret;
do {
ret = mersenne_twister_engine(randGenerator);
} while (ret >= 4294967200);
ret /= 42949672;
volatile int r = ret;

This simplification is not possible when theIntDist is created outside of the loop — the compiler sees that the loop calls uniform_int_distribution with a reference to theIntDist, so it must assume that the value of theIntDist.a and theIntDist.b may change during the execution and can therefore not do the constant folding. The function does, however, not modify theIntDist, so both versions of the program do the same work, but the slow version needs to do one extra comparison/branch and a few extra arithmetic instructions for each loop iteration.

### The cost of division

The mersenne_twister_engine is not a big function, but it is not trivial — it executes about 40 instructions — so it is surprising that adding a few instructions to the loop makes the program four times slower. I described a similar case in a previous blog post where the problem were due to branch mis-prediction, but the branch is perfectly predicted in  this example.

The reason here is that the slow loop need to do an integer division instruction when calculating scaling, and integer division is expensive — Agner Fog’s instruction tables says that the 64-bit division may need up to 103 cycles on the Broadwell microarchitecture! This usually does not matter too much for normal programs as as the compiler tries to move the division instructions so that they have as much time as possible to execute before the result is needed, and the CPU can in general continue executing other instructions out of order while waiting for the result of the division. But it does make a big difference in this kind of micro-benchmarks as the compiler cannot move the division earlier, and the CPU runs out of work to do out of order as the mersenne_twister_engine function executes much faster than the division.

1. If a compiler wouldn't be able to do anything else useful while waiting for integer division results, could a user-written routine outperform the hardware-divide instruction? A 64-bit division would seem long enough that even if a little setup would be needed to allow eight bits of quotient to be computed per successive-approximation step, reducing the number of iterative steps from 64 to 8 could pay off.

1. You should not be able to beat the hardware instruction in general — if so, the the CPU vendor could have saved some silicon area by not implementing the instruction. Or a good compiler would call an asm implementation instead of generating this slow instruction. 😀

But it might be possible if you have special knowledge of the possible input values. But note that the 103 cycles is what the document list as worst case. The division in this example seem to execute in about 40 cycles, which is much harder to beat with a software implementation...

2. If the hardware took a constant 103 cycles that could be 100% overlapped with other instructions that didn't need the result, while a software solution took 80 cycles but couldn't overlap anything, there would be many cases where the hardware solution would be "better". Otherwise, I'm curious how much time would be required to e.g. convert the divisor to "float" or "double", compute the approximate or actual reciprocal, separate out the significant and exponent, and then use a successive approximation. Given a 53-bit reciprocal, two iterations should yield a result within +/1 of being correct, and one more compare/subtract should be able to clean that up.

2. My thought is when you say the compiler can't tell that the distribution hasn't changed - can it work with a distribution which is declared const? Could a compiler, in theory, use such information to lead it to the same realisation? Or would the ability to cast away const qualifiers mean the compiler could just never make that assumption?

1. I believe the compiler would be able to do the optimization if the distribution were declared const. But that gives syntax error that there is no matching function for call to object of type $$\verb!const std::uniform_int_distribution<int>!$$

3. An alternative is to just use bit masking:

uint64_t m = 0xffffffff >> __builtin_clz(range);
do {
ret = randGenerator() & m;
} while (ret > range);

The downside is that this calls the RNG twice in expectation for bad choices of range, and probably yields worse results with bad RNGs since bits are not mixed any further, but in a quick experiment I did it was never slower than division for mt19937, and a lot faster with a fast RNG (xoroshiro128+).