Tag Archives: x86

Visual C++ intrinsics are not as good as inline assembler

I’ve used intrinsics to write some simple SIMD code for SSE2, and they’re pretty handy. They map pretty closely to the assembler output, and generally give enough control over the use of special instructions that inline assembler wouldn’t really be useful. If you need this kind of thing, intrinsics should be your first choice. But sometimes, I think inline assembler would be nice, because there are things that it does better than intrinsics, even when suitable intrinsics are available.

I’m writing some benchmarking code to measure some system properties. I’m trying to measure things like cache and memory latency, so I need a high precision timer to do this; we’re talking operations that take a few dozen processor cycles or so.

The natural choice for a timer is the processor timestamp counter. It runs at processor frequency and, usefully for these measurements, for processors that I care about the counter is synchronized across cores.

To read the counter, you issue the rdtsc instruction. It copies the 64-bit counter into two registers; the low bits go in eax, the high bits in edx. I don’t understand why x64 doesn’t just place all 64 bits into rax, but x86-64 is an extraordinarily lazy extension to x86 that is horrible in lots of places, so what difference does one more make, really?

So it should be simple; put a rdtsc at the start and end of the block of code I want to measure, then compare the two values to see how many cycles it all took.

Unfortunately, for reasons that aren’t at all clear to me as they appear to undermine the instruction’s main use case, rdtsc is not a serializing instruction. The processor’s out-of-order execution machinery can reorder instructions around rdtsc, so instructions that occur after rdtsc in program order can end up running before rdtsc in execution order, and likewise, instructions that occur before rdtsc in program order can end up running after rdtsc in execution order.

So if you’re trying to measure a block of instructions, this means that rdtsc might be run after the block of instructions I’m measuring has actually started, and similarly, rdtsc might be executed before the block has finished. Stupid.

The solution is to emit a serializing instruction before the rdtsc. A serializing instruction is one that the out-of-order machinery can’t shuffle things around. There’s no explicit serialize instruction, however; it’s just a side-effect of certain other instructions. The big ones are those that change processor modes or mess with address spaces in one way or another; obviously speculative execution around these is going to be problematic, because they might change the meaning of speculatively decoded instructions or totally invalidate speculatively read memory.

Because these serializing instructions mostly do serious and important things to processor modes, they’re almost all restricted to ring 0. The kernel can mess with them, but regular user code cannot. As such, they’re useless for timing regular user mode routines.

There is, however, one exception: cpuid. The cpuid instruction is used to access all kinds of processor-specific information. cpuid‘s main role is to list all the various instruction set extensions and capabilities that have been added over the years. You can use it to ask the processor what its name/branding is, the company that built it, to describe its cache topology, and a number of other things.

I don’t really know why cpuid is serializing. Perhaps Intel just decided “well, we need to have some way of serializing that user mode code can run, and this instruction isn’t performance critical anyway”, or something. I assume it’s microcoded, and it can take literally hundreds of cycles to run, so perhaps its sheer length is what makes it serializing—it simply runs so much code that it floods out-of-order buffers. I don’t know. If we were designing the ideal serializing instruction for use with rdtsc, it wouldn’t be cpuid; it writes to all four general-purpose 32-bit registers. rdtsc overwrites two of those registers anyway, so setting those to the correct initial values will always have to occur within the timed portion of your code, but the other two, that’s just because cpuid screwed up your initial state.

Technically, we might have a couple of other options, but they’re problematic. Intel in its instruction manual says that lfence waits for all prior instructions to complete, and does not allow any subsequent instructions to start until the fence operation is complete. This ensures that all loads are completed, prevents any speculative loads from later instructions from occurring (though stores may still be in flight). This is generally good enough for rdtsc, and it avoids clobbering any registers, and so Intel recommends the use of the sequence lfence; rdtsc.

AMD, however, makes no such promise for lfence. On AMD chips, lfence is not documented as waiting for in-flight instructions, nor for blocking execution of further instructions. AMD does, however, make this promise for mfence.

So, what to do? To cover both my bases, I could use back-to-back lfence; mfence and call it close enough, or I could do things the traditional way and stick with cpuid.

All of these instructions have intrinsics so that I can call them directly in C++; they’re __cpuid(__int32[4], __int32)/__cpuidex(__int32[4], __int32, __int32), _mm_lfence(), and _mm_mfence().

But it’s here that Visual C++’s intrinsics are a little bit annoying. The intrinsics mostly map one to one with an actual instruction, but not quite. The rdtsc intrinsic, for example, doesn’t just emit a rdtsc. The code

unsigned __int64 counter = __rdtsc();

emits the sequence:

preserve values of rax and rdx, if necessary
rdtsc
shl rdx, 32
or rax, rdx
mov qword ptr &counter, rax

This usually makes sense; we want those values to be usable in regular C++, so of course they get written to C++ variables. The code for __cpuid is similar; the values from the four 32-bit registers are copied to the array of four __int32s passed as the first parameter:

cpuid
mov memory address of first integer in array, eax
mov memory address of second integer in array, ebx
mov memory address of third integer in array, ecx
mov memory address of fourth integer in array, edx

Problem is, we don’t care about the value from cpuid in this situation. We’re only interested in its side-effect as a serializing instruction. But the intrinsic means that we have to have those four move instructions no matter what.

What this means is that when we write:

std::array<__int32, 4> unused;
__cpuid(unused.data(), 0); // we only want this for serialization!
unsigned __int64 counter = __rdtsc();

We get code like this:

cpuid
mov dword ptr &unused[0], eax
mov dword ptr &unused[1], ebx
mov dword ptr &unused[2], ecx
mov dword ptr &unused[3], edx
rdtsc
shl rdx, 32
or rax, rdx
mov counter, rax

… and the processor is free to issue those four moves after the rdtsc. Which is exactly the thing thing we were trying to avoid by using cpuid in the first place. Now granted, two of those moves use values of registers that rdtsc will overwrite, so I would imagine they’re less likely to get reordered. But the other two are fair game.

There is a second instruction that’s similar to rdtsc, called rdtscp. rdtscp does everything rdtsc does, and more; it also writes the processor’s ID into the ecx register. Moreover, it’s semi-serializing. Instructions that come before the rdtscp in the instruction stream must be completed before the rdtscp executes. However, instructions after the rdtsc can still be moved ahead of it.

As such, while it’s still useful to us, it doesn’t remove the need to use cpuid. Moreover, the rdtscp intrinsic has the same issue as the cpuid intrinsic. The intrinsic is __rdtscp(__int32* processor_id), and as you’d expect from __cpuid(), it always emits a move instruction to store the value of ecx in case we wanted it.

Overall, Intel’s recommended sequence of instructions is:

cpuid // ensure that nothing earlier can be moved below the rdtsc
rdtsc // immediately clobber the result of cpuid
mov edx, dword ptr &start_hi
mov eax, dword ptr &start_lo
code we want to time
rdtscp
mov edx, dword ptr &end_hi
mov eax, dword ptr &end_lo
cpuid // ensure that nothing later can be moved above the rdtscp

But our code using intrinsics:

std::array<__int32, 4> unused;
__cpuid(unused.data(), 0); // we only want this for serialization!
unsigned __int64 start = __rdtsc();
// code we want to time
__int32 also_unused;
unsigned __int64 end = __rdtscp(&also_unused);
__cpuid(unused.data(), 0);

generates:

cpuid
mov dword ptr &unused[0], eax
mov dword ptr &unused[1], ebx
mov dword ptr &unused[2], ecx
mov dword ptr &unused[3], edx
rdtsc
shl rdx, 32
or rax, rdx
mov qword ptr &start, rax
code we want to time
rdtscp
shl rdx, 32
or rax, rdx
mov qword ptr &end, rax
mov qword ptr &also_unused, ecx
cpuid
mov dword ptr &unused[0], eax
mov dword ptr &unused[1], ebx
mov dword ptr &unused[2], ecx
mov dword ptr &unused[3], edx

The last four moves don’t matter here, because they’re outside the timed section. The saving of the processor ID is also outside the timed section, but it’s why we don’t want to use rdtscp for the start counter—if we did that, the save would then be inside the timed section.

What we really want is a way to simply ignore the result of the intrinsics; issue the cpuid instruction but then completely ignore the values it writes to the registers. The intrinsics don’t do this, unfortunately, and the optimizer doesn’t notice that the writes are dead and can be eliminated (my guess is that the optimizer isn’t allowed to touch code generated by intrinsics, but maybe it’s not allowed to do that kind of dead write elimination anyway, I don’t know).

All of this is easy to do with inline assembler. Unfortunately, Visual C++ doesn’t support any inline assembler in 64-bit mode (though it continues to do so in 32-bit mode). gcc’s inline assembler is annoying, as it defaults to garbage AT&T syntax rather than superior Intel syntax (AT&T syntax is godawful for SIB addressing—seriously who could prefer subl -0x20(%ebx,%ecx,0x4),%eax to sub eax,[ebx+ecx*4h-20h]? Nobody, that’s who), but it lets us solve this kind of problem very easily:

asm volatile(
    "cpuid\n\t"
    "rdtsc\n\t"
    "mov %%edx, %0\n\t"
    "mov %%eax, %1\n\t"
    : "=r" (start_hi), "=r" (start_lo) // output %0 is &start_hi, output %1 is &start_lo
    :                                  // no input
    : "%rax", "%rbx", "%rcx", "%rdx"   // clobbers rax, rbx, rcx, rdx
);
// code we want to time
asm volatile(
    "rdtscp\n\t"
    "mov %%edx, %0\n\t"
    "mov %%eax, %1\n\t"
    "cpuid\n\t"
    : "=r" (end_hi), "=r" (end_lo)   // output %0 is &end_hi, output %1 is &end_lo
    :                                // no input
    : "%rax", "%rbx", "%rcx", "%rdx" // clobbers rax, rbx, rcx, rdx
);

But regrettably, I’m not willing to give up Visual C++ just to get inline assembler.

Sometimes, inline assembler could be replaced by writing a function in MASM and linking that separately. But that’s not a good fit here for much the same reason that the intrinsics are an issue; we don’t want to have to capture the overhead of a function call and its return when timing a piece of code. We want there to be as little extra “stuff” between the rdtsc and rdtscp as we can possibly manage, and that means we want inline assembler.