I’ve used intrinsics to write some simple SIMD code for SSE2, and they’re pretty handy. They map pretty closely to the assembler output, and generally give enough control over the use of special instructions that inline assembler wouldn’t really be useful. If you need this kind of thing, intrinsics should be your first choice. But sometimes, I think inline assembler would be nice, because there are things that it does better than intrinsics, even when suitable intrinsics are available.
I’m writing some benchmarking code to measure some system properties. I’m trying to measure things like cache and memory latency, so I need a high precision timer to do this; we’re talking operations that take a few dozen processor cycles or so.
The natural choice for a timer is the processor timestamp counter. It runs at processor frequency and, usefully for these measurements, for processors that I care about the counter is synchronized across cores.
To read the counter, you issue the
rdtsc instruction. It copies the 64-bit counter into two registers; the low bits go in
eax, the high bits in
edx. I don’t understand why x64 doesn’t just place all 64 bits into
rax, but x86-64 is an extraordinarily lazy extension to x86 that is horrible in lots of places, so what difference does one more make, really?
So it should be simple; put a
rdtsc at the start and end of the block of code I want to measure, then compare the two values to see how many cycles it all took.
Unfortunately, for reasons that aren’t at all clear to me as they appear to undermine the instruction’s main use case,
rdtsc is not a serializing instruction. The processor’s out-of-order execution machinery can reorder instructions around
rdtsc, so instructions that occur after
rdtsc in program order can end up running before
rdtsc in execution order, and likewise, instructions that occur before
rdtsc in program order can end up running after
rdtsc in execution order.
So if you’re trying to measure a block of instructions, this means that
rdtsc might be run after the block of instructions I’m measuring has actually started, and similarly,
rdtsc might be executed before the block has finished. Stupid.
The solution is to emit a serializing instruction before the
rdtsc. A serializing instruction is one that the out-of-order machinery can’t shuffle things around. There’s no explicit serialize instruction, however; it’s just a side-effect of certain other instructions. The big ones are those that change processor modes or mess with address spaces in one way or another; obviously speculative execution around these is going to be problematic, because they might change the meaning of speculatively decoded instructions or totally invalidate speculatively read memory.
Because these serializing instructions mostly do serious and important things to processor modes, they’re almost all restricted to ring 0. The kernel can mess with them, but regular user code cannot. As such, they’re useless for timing regular user mode routines.
There is, however, one exception:
cpuid instruction is used to access all kinds of processor-specific information.
cpuid‘s main role is to list all the various instruction set extensions and capabilities that have been added over the years. You can use it to ask the processor what its name/branding is, the company that built it, to describe its cache topology, and a number of other things.
I don’t really know why
cpuid is serializing. Perhaps Intel just decided “well, we need to have some way of serializing that user mode code can run, and this instruction isn’t performance critical anyway”, or something. I assume it’s microcoded, and it can take literally hundreds of cycles to run, so perhaps its sheer length is what makes it serializing—it simply runs so much code that it floods out-of-order buffers. I don’t know. If we were designing the ideal serializing instruction for use with
rdtsc, it wouldn’t be
cpuid; it writes to all four general-purpose 32-bit registers.
rdtsc overwrites two of those registers anyway, so setting those to the correct initial values will always have to occur within the timed portion of your code, but the other two, that’s just because
cpuid screwed up your initial state.
Technically, we might have a couple of other options, but they’re problematic. Intel in its instruction manual says that
lfence waits for all prior instructions to complete, and does not allow any subsequent instructions to start until the fence operation is complete. This ensures that all loads are completed, prevents any speculative loads from later instructions from occurring (though stores may still be in flight). This is generally good enough for
rdtsc, and it avoids clobbering any registers, and so Intel recommends the use of the sequence
AMD, however, makes no such promise for
lfence. On AMD chips,
lfence is not documented as waiting for in-flight instructions, nor for blocking execution of further instructions. AMD does, however, make this promise for
So, what to do? To cover both my bases, I could use back-to-back
lfence; mfence and call it close enough, or I could do things the traditional way and stick with
All of these instructions have intrinsics so that I can call them directly in C++; they’re
__cpuidex(__int32, __int32, __int32),
But it’s here that Visual C++’s intrinsics are a little bit annoying. The intrinsics mostly map one to one with an actual instruction, but not quite. The
rdtsc intrinsic, for example, doesn’t just emit a
rdtsc. The code
unsigned __int64 counter = __rdtsc();
emits the sequence:
preserve values of rax and rdx, if necessary rdtsc shl rdx, 32 or rax, rdx mov qword ptr &counter, rax
This usually makes sense; we want those values to be usable in regular C++, so of course they get written to C++ variables. The code for
__cpuid is similar; the values from the four 32-bit registers are copied to the array of four
__int32s passed as the first parameter:
cpuid mov memory address of first integer in array, eax mov memory address of second integer in array, ebx mov memory address of third integer in array, ecx mov memory address of fourth integer in array, edx
Problem is, we don’t care about the value from
cpuid in this situation. We’re only interested in its side-effect as a serializing instruction. But the intrinsic means that we have to have those four move instructions no matter what.
What this means is that when we write:
std::array<__int32, 4> unused; __cpuid(unused.data(), 0); // we only want this for serialization! unsigned __int64 counter = __rdtsc();
We get code like this:
cpuid mov dword ptr &unused, eax mov dword ptr &unused, ebx mov dword ptr &unused, ecx mov dword ptr &unused, edx rdtsc shl rdx, 32 or rax, rdx mov counter, rax
… and the processor is free to issue those four moves after the
rdtsc. Which is exactly the thing thing we were trying to avoid by using
cpuid in the first place. Now granted, two of those moves use values of registers that
rdtsc will overwrite, so I would imagine they’re less likely to get reordered. But the other two are fair game.
There is a second instruction that’s similar to
rdtscp does everything
rdtsc does, and more; it also writes the processor’s ID into the
ecx register. Moreover, it’s semi-serializing. Instructions that come before the
rdtscp in the instruction stream must be completed before the
rdtscp executes. However, instructions
rdtsc can still be moved ahead of it.
As such, while it’s still useful to us, it doesn’t remove the need to use
cpuid. Moreover, the
rdtscp intrinsic has the same issue as the
cpuid intrinsic. The intrinsic is
__rdtscp(__int32* processor_id), and as you’d expect from
__cpuid(), it always emits a move instruction to store the value of
ecx in case we wanted it.
Overall, Intel’s recommended sequence of instructions is:
cpuid // ensure that nothing earlier can be moved below the rdtsc rdtsc // immediately clobber the result of cpuid mov dword ptr &start_hi, edx mov dword ptr &start_lo, eax code we want to time rdtscp mov dword ptr &end_hi, edx mov dword ptr &end_lo, eax cpuid // ensure that nothing later can be moved above the rdtscp
But our code using intrinsics:
std::array<__int32, 4> unused; __cpuid(unused.data(), 0); // we only want this for serialization! unsigned __int64 start = __rdtsc(); // code we want to time __int32 also_unused; unsigned __int64 end = __rdtscp(&also_unused); __cpuid(unused.data(), 0);
cpuid mov dword ptr &unused, eax mov dword ptr &unused, ebx mov dword ptr &unused, ecx mov dword ptr &unused, edx rdtsc shl rdx, 32 or rax, rdx mov qword ptr &start, rax code we want to time rdtscp shl rdx, 32 or rax, rdx mov qword ptr &end, rax mov qword ptr &also_unused, ecx cpuid mov dword ptr &unused, eax mov dword ptr &unused, ebx mov dword ptr &unused, ecx mov dword ptr &unused, edx
The last four moves don’t matter here, because they’re outside the timed section. The saving of the processor ID is also outside the timed section, but it’s why we don’t want to use
rdtscp for the start counter—if we did that, the save would then be inside the timed section.
What we really want is a way to simply ignore the result of the intrinsics; issue the
cpuid instruction but then completely ignore the values it writes to the registers. The intrinsics don’t do this, unfortunately, and the optimizer doesn’t notice that the writes are dead and can be eliminated (my guess is that the optimizer isn’t allowed to touch code generated by intrinsics, but maybe it’s not allowed to do that kind of dead write elimination anyway, I don’t know).
All of this is easy to do with inline assembler. Unfortunately, Visual C++ doesn’t support any inline assembler in 64-bit mode (though it continues to do so in 32-bit mode). gcc’s inline assembler is annoying, as it defaults to garbage AT&T syntax rather than superior Intel syntax (AT&T syntax is godawful for SIB addressing—seriously who could prefer
subl -0x20(%ebx,%ecx,0x4),%eax to
sub eax,[ebx+ecx*4h-20h]? Nobody, that’s who), but it lets us solve this kind of problem very easily:
asm volatile( "cpuid\n\t" "rdtsc\n\t" "mov %%edx, %0\n\t" "mov %%eax, %1\n\t" : "=r" (start_hi), "=r" (start_lo) // output %0 is &start_hi, output %1 is &start_lo : // no input : "%rax", "%rbx", "%rcx", "%rdx" // clobbers rax, rbx, rcx, rdx ); // code we want to time asm volatile( "rdtscp\n\t" "mov %%edx, %0\n\t" "mov %%eax, %1\n\t" "cpuid\n\t" : "=r" (end_hi), "=r" (end_lo) // output %0 is &end_hi, output %1 is &end_lo : // no input : "%rax", "%rbx", "%rcx", "%rdx" // clobbers rax, rbx, rcx, rdx );
But regrettably, I’m not willing to give up Visual C++ just to get inline assembler.
Sometimes, inline assembler could be replaced by writing a function in MASM and linking that separately. But that’s not a good fit here for much the same reason that the intrinsics are an issue; we don’t want to have to capture the overhead of a function call and its return when timing a piece of code. We want there to be as little extra “stuff” between the
rdtscp as we can possibly manage, and that means we want inline assembler.