Tag Archives: programming

Visual C++ intrinsics are not as good as inline assembler

I’ve used intrinsics to write some simple SIMD code for SSE2, and they’re pretty handy. They map pretty closely to the assembler output, and generally give enough control over the use of special instructions that inline assembler wouldn’t really be useful. If you need this kind of thing, intrinsics should be your first choice. But sometimes, I think inline assembler would be nice, because there are things that it does better than intrinsics, even when suitable intrinsics are available.

I’m writing some benchmarking code to measure some system properties. I’m trying to measure things like cache and memory latency, so I need a high precision timer to do this; we’re talking operations that take a few dozen processor cycles or so.

The natural choice for a timer is the processor timestamp counter. It runs at processor frequency and, usefully for these measurements, for processors that I care about the counter is synchronized across cores.

To read the counter, you issue the rdtsc instruction. It copies the 64-bit counter into two registers; the low bits go in eax, the high bits in edx. I don’t understand why x64 doesn’t just place all 64 bits into rax, but x86-64 is an extraordinarily lazy extension to x86 that is horrible in lots of places, so what difference does one more make, really?

So it should be simple; put a rdtsc at the start and end of the block of code I want to measure, then compare the two values to see how many cycles it all took.

Unfortunately, for reasons that aren’t at all clear to me as they appear to undermine the instruction’s main use case, rdtsc is not a serializing instruction. The processor’s out-of-order execution machinery can reorder instructions around rdtsc, so instructions that occur after rdtsc in program order can end up running before rdtsc in execution order, and likewise, instructions that occur before rdtsc in program order can end up running after rdtsc in execution order.

So if you’re trying to measure a block of instructions, this means that rdtsc might be run after the block of instructions I’m measuring has actually started, and similarly, rdtsc might be executed before the block has finished. Stupid.

The solution is to emit a serializing instruction before the rdtsc. A serializing instruction is one that the out-of-order machinery can’t shuffle things around. There’s no explicit serialize instruction, however; it’s just a side-effect of certain other instructions. The big ones are those that change processor modes or mess with address spaces in one way or another; obviously speculative execution around these is going to be problematic, because they might change the meaning of speculatively decoded instructions or totally invalidate speculatively read memory.

Because these serializing instructions mostly do serious and important things to processor modes, they’re almost all restricted to ring 0. The kernel can mess with them, but regular user code cannot. As such, they’re useless for timing regular user mode routines.

There is, however, one exception: cpuid. The cpuid instruction is used to access all kinds of processor-specific information. cpuid‘s main role is to list all the various instruction set extensions and capabilities that have been added over the years. You can use it to ask the processor what its name/branding is, the company that built it, to describe its cache topology, and a number of other things.

I don’t really know why cpuid is serializing. Perhaps Intel just decided “well, we need to have some way of serializing that user mode code can run, and this instruction isn’t performance critical anyway”, or something. I assume it’s microcoded, and it can take literally hundreds of cycles to run, so perhaps its sheer length is what makes it serializing—it simply runs so much code that it floods out-of-order buffers. I don’t know. If we were designing the ideal serializing instruction for use with rdtsc, it wouldn’t be cpuid; it writes to all four general-purpose 32-bit registers. rdtsc overwrites two of those registers anyway, so setting those to the correct initial values will always have to occur within the timed portion of your code, but the other two, that’s just because cpuid screwed up your initial state.

Technically, we might have a couple of other options, but they’re problematic. Intel in its instruction manual says that lfence waits for all prior instructions to complete, and does not allow any subsequent instructions to start until the fence operation is complete. This ensures that all loads are completed, prevents any speculative loads from later instructions from occurring (though stores may still be in flight). This is generally good enough for rdtsc, and it avoids clobbering any registers, and so Intel recommends the use of the sequence lfence; rdtsc.

AMD, however, makes no such promise for lfence. On AMD chips, lfence is not documented as waiting for in-flight instructions, nor for blocking execution of further instructions. AMD does, however, make this promise for mfence.

So, what to do? To cover both my bases, I could use back-to-back lfence; mfence and call it close enough, or I could do things the traditional way and stick with cpuid.

All of these instructions have intrinsics so that I can call them directly in C++; they’re __cpuid(__int32[4], __int32)/__cpuidex(__int32[4], __int32, __int32), _mm_lfence(), and _mm_mfence().

But it’s here that Visual C++’s intrinsics are a little bit annoying. The intrinsics mostly map one to one with an actual instruction, but not quite. The rdtsc intrinsic, for example, doesn’t just emit a rdtsc. The code

unsigned __int64 counter = __rdtsc();

emits the sequence:

preserve values of rax and rdx, if necessary
rdtsc
shl rdx, 32
or rax, rdx
mov qword ptr &counter, rax

This usually makes sense; we want those values to be usable in regular C++, so of course they get written to C++ variables. The code for __cpuid is similar; the values from the four 32-bit registers are copied to the array of four __int32s passed as the first parameter:

cpuid
mov memory address of first integer in array, eax
mov memory address of second integer in array, ebx
mov memory address of third integer in array, ecx
mov memory address of fourth integer in array, edx

Problem is, we don’t care about the value from cpuid in this situation. We’re only interested in its side-effect as a serializing instruction. But the intrinsic means that we have to have those four move instructions no matter what.

What this means is that when we write:

std::array<__int32, 4> unused;
__cpuid(unused.data(), 0); // we only want this for serialization!
unsigned __int64 counter = __rdtsc();

We get code like this:

cpuid
mov dword ptr &unused[0], eax
mov dword ptr &unused[1], ebx
mov dword ptr &unused[2], ecx
mov dword ptr &unused[3], edx
rdtsc
shl rdx, 32
or rax, rdx
mov counter, rax

… and the processor is free to issue those four moves after the rdtsc. Which is exactly the thing thing we were trying to avoid by using cpuid in the first place. Now granted, two of those moves use values of registers that rdtsc will overwrite, so I would imagine they’re less likely to get reordered. But the other two are fair game.

There is a second instruction that’s similar to rdtsc, called rdtscp. rdtscp does everything rdtsc does, and more; it also writes the processor’s ID into the ecx register. Moreover, it’s semi-serializing. Instructions that come before the rdtscp in the instruction stream must be completed before the rdtscp executes. However, instructions after the rdtsc can still be moved ahead of it.

As such, while it’s still useful to us, it doesn’t remove the need to use cpuid. Moreover, the rdtscp intrinsic has the same issue as the cpuid intrinsic. The intrinsic is __rdtscp(__int32* processor_id), and as you’d expect from __cpuid(), it always emits a move instruction to store the value of ecx in case we wanted it.

Overall, Intel’s recommended sequence of instructions is:

cpuid // ensure that nothing earlier can be moved below the rdtsc
rdtsc // immediately clobber the result of cpuid
mov edx, dword ptr &start_hi
mov eax, dword ptr &start_lo
code we want to time
rdtscp
mov edx, dword ptr &end_hi
mov eax, dword ptr &end_lo
cpuid // ensure that nothing later can be moved above the rdtscp

But our code using intrinsics:

std::array<__int32, 4> unused;
__cpuid(unused.data(), 0); // we only want this for serialization!
unsigned __int64 start = __rdtsc();
// code we want to time
__int32 also_unused;
unsigned __int64 end = __rdtscp(&also_unused);
__cpuid(unused.data(), 0);

generates:

cpuid
mov dword ptr &unused[0], eax
mov dword ptr &unused[1], ebx
mov dword ptr &unused[2], ecx
mov dword ptr &unused[3], edx
rdtsc
shl rdx, 32
or rax, rdx
mov qword ptr &start, rax
code we want to time
rdtscp
shl rdx, 32
or rax, rdx
mov qword ptr &end, rax
mov qword ptr &also_unused, ecx
cpuid
mov dword ptr &unused[0], eax
mov dword ptr &unused[1], ebx
mov dword ptr &unused[2], ecx
mov dword ptr &unused[3], edx

The last four moves don’t matter here, because they’re outside the timed section. The saving of the processor ID is also outside the timed section, but it’s why we don’t want to use rdtscp for the start counter—if we did that, the save would then be inside the timed section.

What we really want is a way to simply ignore the result of the intrinsics; issue the cpuid instruction but then completely ignore the values it writes to the registers. The intrinsics don’t do this, unfortunately, and the optimizer doesn’t notice that the writes are dead and can be eliminated (my guess is that the optimizer isn’t allowed to touch code generated by intrinsics, but maybe it’s not allowed to do that kind of dead write elimination anyway, I don’t know).

All of this is easy to do with inline assembler. Unfortunately, Visual C++ doesn’t support any inline assembler in 64-bit mode (though it continues to do so in 32-bit mode). gcc’s inline assembler is annoying, as it defaults to garbage AT&T syntax rather than superior Intel syntax (AT&T syntax is godawful for SIB addressing—seriously who could prefer subl -0x20(%ebx,%ecx,0x4),%eax to sub eax,[ebx+ecx*4h-20h]? Nobody, that’s who), but it lets us solve this kind of problem very easily:

asm volatile(
    "cpuid\n\t"
    "rdtsc\n\t"
    "mov %%edx, %0\n\t"
    "mov %%eax, %1\n\t"
    : "=r" (start_hi), "=r" (start_lo) // output %0 is &start_hi, output %1 is &start_lo
    :                                  // no input
    : "%rax", "%rbx", "%rcx", "%rdx"   // clobbers rax, rbx, rcx, rdx
);
// code we want to time
asm volatile(
    "rdtscp\n\t"
    "mov %%edx, %0\n\t"
    "mov %%eax, %1\n\t"
    "cpuid\n\t"
    : "=r" (end_hi), "=r" (end_lo)   // output %0 is &end_hi, output %1 is &end_lo
    :                                // no input
    : "%rax", "%rbx", "%rcx", "%rdx" // clobbers rax, rbx, rcx, rdx
);

But regrettably, I’m not willing to give up Visual C++ just to get inline assembler.

Sometimes, inline assembler could be replaced by writing a function in MASM and linking that separately. But that’s not a good fit here for much the same reason that the intrinsics are an issue; we don’t want to have to capture the overhead of a function call and its return when timing a piece of code. We want there to be as little extra “stuff” between the rdtsc and rdtscp as we can possibly manage, and that means we want inline assembler.

Some notes on reinstating shadow copies on Windows 8

Most of the time, the thing that people need from a backup system isn’t disaster recovery. It’s accident reversal. Yes, it’s important to have backups stored on separate disks, ideally at separate locations, to handle hardware failures. But that’s not the most common data loss scenario, just the most severe. Far more common is “whoops, I didn’t mean to save over that file”.

Shadow Copies are an excellent solution for this common problem. When Shadow Copies are enabled, Windows periodically creates snapshots of the disk using the Volume Snapshot Service (VSS). As long as the snapshot exists, you can get a view of the disk as it existed at the time of the snapshot. For Shadow Copies, the operating system reserves a chunk of disk space (which is configurable) for snapshot storage. Once this disk space is full, snapshots start getting purged, starting with the oldest.

The length a snapshot can last depends on the amout of disk space you assign to the shadows and the amount of churn your disk experiences. Every modification made after the snapshot causes the snapshot delta file to grow larger, so disks with lots of writes will tend to burn through their snapshots quite quickly. Disks with less write activity will tend to retain a longer history.

Server versions of Windows, since Windows Server 2003, have had Shadow Copies as a supported feature with a user-visible interface. Shadow Copy configuration is done in Computer Management\System Tools\Shared Folders, where you can configure which volumes get snapshotted, how much storage is assigned to snapshots, and how often the snapshots are made. Shadow Copies are off by default; turn them on, and their schedule is organized around the working week, with two snapshots per day Monday to Friday, at 0700 hrs and 1200 hrs.

Browsing and restoring from Shadow Copies is done through Explorer. The Properties dialog of any folder protected by Shadow Copies has a Previous Versions tab, which allows you to pick an earlier version and restore from it.

Windows Vista and Windows 7 incorporate most of the machinery used to support Shadow Copies, and in these operating systems too, it’s a supported feature. They lack much of the configuration user interface, however. The operating system does make periodic snapshots, but these are primarily created for the purpose of System Restore; a new snapshot is made at midnight each night, by default.

Whether server or workstation, all versions of Windows have essentially the same underlying functionality. The behind-the-scenes components—the Volume Snapshot Service, or VSS, the filesystem drivers, the plugins from apps like SQL Server (so that they can make sure that their files are consistent before a snapshot is preserved)—are all the same.

This means that Windows 8 is perfectly capable of making periodic snapshots. What it’s lacking is some of the user interface components, both in the GUI and at the command-line. As with the underlying parts, the user interface parts are all there; it’s just that they’re deliberately crippled.

There are perhaps three important command-line tools for manipulating VSS. vssadmin.exe, diskshadow.exe, and vshadow.exe. vssadmin.exe is found on both server and desktop Windows. On the desktop, it can be used for querying VSS (enumerating disks with snapshot storage, stored snapshots, VSS plugins, and so on), deleting shadow copies, and changing the amount of storage dedicated to snapshot storage. On the server, it gains the all-important ability to create snapshots and revert the disk to a previous version (equivalent to System Restore).

diskshadow.exe is found only on server versions. It’s modelled after tools like diskpart.exe, providing a kind of interactive shell. It gives fairly extensive control over the creation of shadow copies, including the creation of multi-volume snapshots, exposing existing shadows as drive letters or mount points, and certain more advanced tasks. It’s quite useful to have, but it’s not required for basic shadow copy functionality.

vshadow.exe is found in the Windows SDK. It’s notionally a “sample” client for VSS to showcase its features. It is, in practice, a fairly fully-featured front-end to VSS.

Because it’s found in all versions of Windows already, vssadmin.exe is the best place to start with shadow copies. Not only is it found in all versions of Windows, it’s also the program used to actually create shadow copies in server versions of Window; it’s simply executed as a scheduled task. Although using vshadow.exe would work, making vssadmin.exe do the right thing also means that we don’t need to install the SDK onto systems just to get working shadow copies.

There’s just one small problem with that. As mentioned, its reported capabilities are different on desktop and server. However, a quick check of the executable itself shows that the program is the same regardless of which operating system variant it’s being used with.

On desktop, we see the reduced feature set:

C:>vssadmin
vssadmin 1.1 - Volume Shadow Copy Service administrative command-line tool
(C) Copyright 2001-2012 Microsoft Corp.

Error: Invalid command.

---- Commands Supported ----

Delete Shadows        - Delete volume shadow copies
List Providers        - List registered volume shadow copy providers
List Shadows          - List existing volume shadow copies
List ShadowStorage    - List volume shadow copy storage associations
List Volumes          - List volumes eligible for shadow copies
List Writers          - List subscribed volume shadow copy writers
Resize ShadowStorage  - Resize a volume shadow copy storage association

On a server:

C:>vssadmin
vssadmin 1.1 - Volume Shadow Copy Service administrative command-line tool
(C) Copyright 2001-2012 Microsoft Corp.

Error: Invalid command.

---- Commands Supported ----

Add ShadowStorage     - Add a new volume shadow copy storage association
Create Shadow         - Create a new volume shadow copy
Delete Shadows        - Delete volume shadow copies
Delete ShadowStorage  - Delete volume shadow copy storage associations
List Providers        - List registered volume shadow copy providers
List Shadows          - List existing volume shadow copies
List ShadowStorage    - List volume shadow copy storage associations
List Volumes          - List volumes eligible for shadow copies
List Writers          - List subscribed volume shadow copy writers
Resize ShadowStorage  - Resize a volume shadow copy storage association
Revert Shadow         - Revert a volume to a shadow copy
Query Reverts         - Query the progress of in-progress revert operations.

With identical binaries on both systems, it’s clear that it must be doing some kind of runtime operating system detection to figure out which features to offer.

There are a few different ways of figuring out what operating system a Windows program is running on. To do this, I used the very useful API Monitor by Rohitab Batra. Sure enough, early in the program’s execution was a call to GetVersionExW(), one of the APIs for figuring out what version of Windows is being used.

I then wanted to verify the way in which the API was being used; which information was being used, and how it was being used. To do this, I used Visual Studio 2012. I created a new blank solution, then added an existing project, using the vssadmin.exe executable itself as the project file. This rather obtuse operation lets you debug an existing binary. I also made sure to turn on symbol server support, so that public symbols for Windows programs would be downloaded from Microsoft’s servers.

Right click the “project”, choose Debug… Start new instance, and Visual Studio loads the program and stops at the entrypoint. With the program suspended, we can then set a breakpoint on GetVersionExW() to let see the context in which the API was called.

The call stack shows that the call is made from a function CVssSKU::Initialize(). This is a fairly short and simple function:

000007F60C71B9FC  push        rbx  
000007F60C71B9FE  sub         rsp,150h  
000007F60C71BA05  mov         rax,qword ptr [__security_cookie (07F60C723110h)]  
000007F60C71BA0C  xor         rax,rsp  
000007F60C71BA0F  mov         qword ptr [rsp+140h],rax  
000007F60C71BA17  xor         ebx,ebx  
000007F60C71BA19  cmp         dword ptr [CVssSKU::ms_bInitialized (07F60C723814h)],ebx  
000007F60C71BA1F  jne         CVssSKU::Initialize+8Dh (07F60C71BA89h)  
000007F60C71BA21  lea         rcx,[rsp+20h]  
000007F60C71BA26  mov         dword ptr [rsp+20h],11Ch  
000007F60C71BA2E  call        qword ptr [__imp_GetVersionExW (07F60C725788h)]  
000007F60C71BA34  mov         al,byte ptr [rsp+13Ah]  
000007F60C71BA3B  lea         r11d,[rbx+1]  
000007F60C71BA3F  cmp         al,2  
000007F60C71BA41  je          CVssSKU::Initialize+59h (07F60C71BA55h)  
000007F60C71BA43  cmp         al,3  
000007F60C71BA45  je          CVssSKU::Initialize+59h (07F60C71BA55h)  
000007F60C71BA47  cmp         al,r11b  
000007F60C71BA4A  sete        bl  
000007F60C71BA4D  mov         dword ptr [CVssSKU::ms_eSKU (07F60C723818h)],ebx  
000007F60C71BA53  jmp         CVssSKU::Initialize+86h (07F60C71BA82h)  
000007F60C71BA55  test        byte ptr [rsp+138h],40h  
000007F60C71BA5D  je          CVssSKU::Initialize+75h (07F60C71BA71h)  
000007F60C71BA5F  mov         dword ptr [CVssSKU::ms_eSKU (07F60C723818h)],4  
000007F60C71BA69  mov         dword ptr [CVssSKU::ms_bTransportableShadowsAllowed (07F60C72381Ch)],ebx  
000007F60C71BA6F  jmp         CVssSKU::Initialize+86h (07F60C71BA82h)  
000007F60C71BA71  mov         dword ptr [CVssSKU::ms_eSKU (07F60C723818h)],2  
000007F60C71BA7B  mov         dword ptr [CVssSKU::ms_bTransportableShadowsAllowed (07F60C72381Ch)],r11d  
000007F60C71BA82  mov         dword ptr [CVssSKU::ms_bInitialized (07F60C723814h)],r11d  
000007F60C71BA89  mov         rcx,qword ptr [rsp+140h]  
000007F60C71BA91  xor         rcx,rsp  
000007F60C71BA94  call        __security_check_cookie (07F60C71D5E0h)  
000007F60C71BA99  add         rsp,150h  
000007F60C71BAA0  pop         rbx  
000007F60C71BAA1  ret  

From here there are a few important details. First, that the structure is on the stack, at an offset of 0x20, and second, that its field dwOSVersionInfoSize is initialized to 0x11c (284 bytes):

000007F60C71BA26  mov         dword ptr [rsp+20h],11Ch  

This means that the program is using the larger, more detailed OSVERSIONINFOEXW structure.

Third, after the call is made (with no check of the return value), one value is read from the structure:

000007F60C71BA34  mov         al,byte ptr [rsp+13Ah]  

This is the field at offset 0x11a within the structure itself, which is:

BYTE  wProductType;

There are three possible values for this field. If it contains 2 or 3, then the Windows version is Domain Controller or Server, respectively. If it contains 1, then it’s workstation/desktop Windows.

The function treats 2 and 3 identically:

000007F60C71BA3F  cmp         al,2  
000007F60C71BA41  je          CVssSKU::Initialize+59h (07F60C71BA55h)  
000007F60C71BA43  cmp         al,3  
000007F60C71BA45  je          CVssSKU::Initialize+59h (07F60C71BA55h)

If neither test branch is taken, it falls through to the workstation case. We’ll consider that case first. There may be some nuance to this that I don’t understand (I find reading assembly troublesome at the best of times). The line

000007F60C71BA17  xor         ebx,ebx  

Zeroes out the bottom 32 bits of the rbx register. The line

000007F60C71BA3B  lea         r11d,[rbx+1]  

sets the bottom 32 bits of the r11 register to the value of rbx + 1 (i.e. 1). Why it does it this way I don’t understand. rbx is a preserved register; its value is supposed to be maintained across function calls. This means that it must be a constant 1, so why not just set r11 to a constant 1? Indeed, why set r11 to anything at all—why not continue to use rbx?

Anyway, the bottom 8 bits of r11 are then compared with the wProductType:

000007F60C71BA47  cmp         al,r11b  

If they’re equal, then the bottom 8 bits of rbx are set to 1:

000007F60C71BA4A  sete        bl  

This value is then stored for subsequent lookups:

000007F60C71BA4D  mov         dword ptr [CVssSKU::ms_eSKU (07F60C723818h)],ebx  

Note how the public symbols are helpful here; it is fairly clear that CVssSKU::ms_eSKU is used as a record of which Windows SKU is being used.

The code then jumps to the common tail of the function

000007F60C71BA53  jmp         CVssSKU::Initialize+86h (07F60C71BA82h)  

which sets a flag to indicate that the initialization has been called

000007F60C71BA82  mov         dword ptr [CVssSKU::ms_bInitialized (07F60C723814h)],r11d  

and then does the stack cookie check and exits

000007F60C71BA89  mov         rcx,qword ptr [rsp+140h]  
000007F60C71BA91  xor         rcx,rsp  
000007F60C71BA94  call        __security_check_cookie (07F60C71D5E0h)  
000007F60C71BA99  add         rsp,150h  
000007F60C71BAA0  pop         rbx  
000007F60C71BAA1  ret  

So that’s what’s happening on desktop Windows. How about the servers?

In that case, a second test is performed.

000007F60C71BA55  test        byte ptr [rsp+138h],40h  

This examines the byte at offset 0x118 in the structure and does a bitwise AND with 0x40. This is the first byte of

  WORD  wSuiteMask;

Because x64 is a little-endian architecture, the first byte of the 16-bit word is the bottom 8 bits. For some reason I cannot fathom, this appears to be a test of whether VER_SUITE_EMBEDDEDNT is specified.

If the result of the bitwise AND is zero then the ZF flag is set, and je jumps are taken. If the branch is not taken (i.e. if it is running on Embedded NT) then

000007F60C71BA5F  mov         dword ptr [CVssSKU::ms_eSKU (07F60C723818h)],4  
000007F60C71BA69  mov         dword ptr [CVssSKU::ms_bTransportableShadowsAllowed (07F60C72381Ch)],ebx  

The SKU is set to 4, and the value 0 (remembering that in this branch the rbx register is never modified after being initially zeroed) is stored in the CVssSKU::ms_bTransportableShadowsAllowed variable to denote that transportable shadow copies aren’t permitted.

Transportable shadow copies are shadow copies with extra metadata preserved in Windows Cabinet (.cab) files so that they can be mounted on different systems.

It then jumps to the common tail, as before

000007F60C71BA6F  jmp         CVssSKU::Initialize+86h (07F60C71BA82h)  

Otherwise, for non-embedded NT, we have

000007F60C71BA71  mov         dword ptr [CVssSKU::ms_eSKU (07F60C723818h)],2  
000007F60C71BA7B  mov         dword ptr [CVssSKU::ms_bTransportableShadowsAllowed (07F60C72381Ch)],r11d  

The SKU is set to 2, and transportable shadow copies are permitted.

These various stored values (CVssSKU::ms_eSKU and CVssSKU::ms_bTransportableShadowsAllowed) are then examined at various places in vssadmin.exe. It doesn’t appear to perform any other interrogation or examination of the underlying operating system or its capabilities; this check is it.

To trick vssadmin.exe into giving us the full set of server features, we need only change the value of wProductType, replacing a 1 with a 2 or a 3. 3 makes the most sense, as obviously Windows workstations are not domain controllers. If we change the value in the debugger, lo and behold, the full set of options and features are shown in the help text, and the extra, hidden commands spring into life.

Obviously, stopping the program in the debugger and changing values in memory is not particularly convenient. What we really want to do is to hook the call to GetVersionExW() and replace the value automatically.

The standard technique for doing this on Windows is to spawn the process, inject a DLL into the process, and use that DLL to modify functions appropriately. It’s a time-honoured and trusted technique, it’s robust and well-understood, and so it’s what we’ll do here. First things first: DLL injection.

This is actually pretty simple. The outline is:

  1. Start the process in the suspended state.
  2. Allocate some memory within the process to store parameters, with VirtualAllocEx().
  3. Copy the parameters to the memory, with WriteProcessMemory().
  4. Use CreateRemoteThread() to start a thread within the process, with LoadLibrary() as the thread’s function, and the allocated memory as the thread’s parameter.
  5. Do the actual work from the DLL’s DllMain().

We can do this because essentially LoadLibrary() has a compatible type with ThreadProc() (the type of thread functions). The types are

HMODULE WINAPI LoadLibraryW(const wchar_t* lpFileName);

and

DWORD WINAPI ThreadProc(void* lpParameter);

There is one minor difference—the width of the return value (pointer-sized for LoadLibrary(), 32-bit for ThreadProc()) but it’s immaterial, as return values are passed in the rax register anyway, so there’s no practical difference between 32- and 64-bit integers.

The code to do this is pretty simple. It looks something like this:

#define _CRT_SECURE_NO_WARNINGS 1
#define NOMINMAX
#define STRICT

#include <SDKDDKVer.h>
#include <Windows.h>

#include <memory>
#include <cstring>

int main() {
    const wchar_t* vssadmin_path =  L"%systemroot%\system32\vssadmin.exe";
    DWORD buffer_size = ::ExpandEnvironmentStringsW(vssadmin_path, nullptr, 0);
    std::unique_ptr<wchar_t[]> vssadmin(new wchar_t[buffer_size]);
    ::ExpandEnvironmentStringsW(vssadmin_path, vssadmin.get(), buffer_size);

    ::STARTUPINFOW si = { sizeof(STARTUPINFOW) };
    ::PROCESS_INFORMATION pi = { 0 };
    ::CreateProcessW(vssadmin.get(), ::GetCommandLineW(), nullptr, nullptr, FALSE, CREATE_SUSPENDED | CREATE_PRESERVE_CODE_AUTHZ_LEVEL | INHERIT_PARENT_AFFINITY, nullptr, nullptr, &si, &pi);

    // we want to load the DLL from the directory with the wrapper program in it, not from vssadmin.exe's directory
    // so we need to provide the full path to the DLL
    const wchar_t* dll_base_name = L"patcher.dll";
    wchar_t dll_name[MAX_PATH] = {0}; // HATE
    ::GetModuleFileName(nullptr, dll_name, sizeof(dll_name) / sizeof(*dll_name));
    std::wcscpy(std::wcsrchr(dll_name, L'\') + 1, dll_base_name);

    void* target_memory = ::VirtualAllocEx(pi.hProcess, nullptr, sizeof(dll_name), MEM_COMMIT, PAGE_READWRITE);
    SIZE_T bytes_written = 0;
    ::WriteProcessMemory(pi.hProcess, target_memory, dll_name, sizeof(dll_name), &bytes_written);
    HANDLE remote_thread = ::CreateRemoteThread(pi.hProcess, nullptr, 0, reinterpret_cast<PTHREAD_START_ROUTINE>(&LoadLibraryW), target_memory, 0, nullptr);
    ::WaitForSingleObject(remote_thread, INFINITE);
    ::CloseHandle(remote_thread);
    ::ResumeThread(pi.hThread);
    ::WaitForSingleObject(pi.hProcess, INFINITE);
    DWORD exit_code = 0;
    ::GetExitCodeProcess(pi.hProcess, &exit_code);
    ::CloseHandle(pi.hThread);
    ::CloseHandle(pi.hProcess);
    return exit_code;
}

That lets us get code running inside the vssadmin.exe process. But what should that code do?

There are two ways of using DLLs in windows. Although all DLLs are “dynamically linked”, there’s dynamic and there’s dynamic. In the really dynamic case, you look up the name of the function you want to use with GetProcAddress() at runtime. This allows for things like loading DLLs supplied by a user (such as plugins) or tentative loading of functions that may or may not be present.

However, most Windows system DLLs aren’t used this way at all. They use load-time dynamic listing. Each executable embeds a data structure called the import address table (IAT). This table lists the names of all the DLLs that the executable uses, the names of all the functions in each DLL that the executable uses, and the address of each of those functions. The executable itself just contains null pointers for the addresses. When it loads the executable, Windows replaces all those null pointers with the actual addresses required.

vssadmin.exe’s use of GetVersionExW() uses this load-time dynamic linking. The vssadmin.exe executable has an IAT, and that IAT includes the import of the function GetVersionExW() from kernel32.dll.

To hook the function, what our DLL needs to do is to examine vssadmin.exe’s IAT, find the specific entry we’re interested in, and replace the address with an address of a function that we control.

We don’t need to go into the finer details of the PE format. The format is documented by Microsoft, and there’s lots of code out there to do this kind of thing. Here’s mine:

#define _CRT_SECURE_NO_WARNINGS 1
#define NOMINMAX
#define STRICT

#include <SDKDDKVer.h>
#include <Windows.h>

#define MakePtr(cast, base, offset) reinterpret_cast<cast>(reinterpret_cast<size_t>(base) + static_cast<size_t>(offset))

PROC hook_iat(HMODULE importing_module, const char* exporting_module, PSTR function_name, PROC hooking_proc)
{
    if(!importing_module) {
        return nullptr;
    }

    PROC original_proc = ::GetProcAddress(::GetModuleHandleA(exporting_module), function_name);
    if(!original_proc) {
        return nullptr;
    }

    if(::IsBadCodePtr(hooking_proc)) {
        return nullptr;
    }

    IMAGE_DOS_HEADER* dos_header = reinterpret_cast<IMAGE_DOS_HEADER*>(importing_module);
    if(::IsBadReadPtr(dos_header, sizeof(IMAGE_DOS_HEADER)) || dos_header->e_magic != IMAGE_DOS_SIGNATURE) {
        return nullptr;
    }

    IMAGE_NT_HEADERS* pe_header = MakePtr(IMAGE_NT_HEADERS*, dos_header, dos_header->e_lfanew);
    if(::IsBadReadPtr(pe_header, sizeof(IMAGE_NT_HEADERS)) || pe_header->Signature != IMAGE_NT_SIGNATURE) {
        return nullptr;
    }

    if(pe_header->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_IMPORT].VirtualAddress == 0) {
        return nullptr;
    }

    for(IMAGE_IMPORT_DESCRIPTOR* import_descriptor = MakePtr(IMAGE_IMPORT_DESCRIPTOR*, dos_header, pe_header->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_IMPORT].VirtualAddress);
        import_descriptor->Name;
        ++import_descriptor) {
        if(_stricmp(MakePtr(const char*, dos_header, import_descriptor->Name), exporting_module) == 0) {
            for(IMAGE_THUNK_DATA* thunk = MakePtr(IMAGE_THUNK_DATA*, dos_header, import_descriptor->FirstThunk);
                thunk->u1.Function;
                ++thunk) {

                if(thunk->u1.Function == reinterpret_cast<size_t>(original_proc)) {
                    DWORD original_protection = 0;
                    ::VirtualProtect(&thunk->u1.Function, sizeof(&thunk->u1.Function), PAGE_READWRITE, &original_protection);
                    thunk->u1.Function = reinterpret_cast<size_t>(hooking_proc);
                    DWORD ignored = 0;
                    ::VirtualProtect(&thunk->u1.Function, sizeof(&thunk->u1.Function), original_protection, &ignored);
                    return original_proc;
                }
            }
        }
    }
    return nullptr; // Function not found
}

typedef BOOL (WINAPI* gve)(OSVERSIONINFOW*);

gve GetVersionExOriginal;

BOOL WINAPI GetVersionExForcedServer(OSVERSIONINFOW* version_info) {
    BOOL return_value = GetVersionExOriginal(version_info);
    switch(version_info->dwOSVersionInfoSize) {
    case sizeof(OSVERSIONINFOW):
        break;
    case sizeof(OSVERSIONINFOEXW): {
            OSVERSIONINFOEXW* version_info_ex = reinterpret_cast<OSVERSIONINFOEXW*>(version_info);
            if(version_info_ex->wProductType == VER_NT_WORKSTATION) {
                version_info_ex->wProductType = VER_NT_SERVER;
            }
        }
        break;
    }
    return return_value;
}

BOOL APIENTRY DllMain(HMODULE module, DWORD reason, void* reserved) {
    switch(reason) {
    case DLL_PROCESS_ATTACH: {
            GetVersionExOriginal = reinterpret_cast<gve>(hook_iat(::GetModuleHandleW(nullptr), "kernel32.dll", "GetVersionExW", reinterpret_cast<PROC>(&GetVersionExForcedServer)));
        }
        break;
    case DLL_THREAD_ATTACH:
    case DLL_THREAD_DETACH:
    case DLL_PROCESS_DETACH: {
            
        }
        break;
    }
    return TRUE;
}

Compile that all and run the resulting program from an elevated command prompt. It’ll give you the full set of vssadmin.exe options, including the important create shadow command used to actually create shadow copy snapshots.

From here it’s just a matter of creating a scheduled task to run the wrapper program periodically.

Grand Central Dispatch for Win32: things still to do

So, the libdispatch port I’ve been working on is currently quite rough and ready. The major parts all seem to work, though I need to migrate all the tests, but there’s one significant piece missing: the main queue.

Cocoa is, for the most part, single-threaded; updates to a window must be performed on the the thread that owns that window. The same is true of WPF, Win32, and others. However, Cocoa takes things a little further than Win32. In Win32 all threads are essentially created equal, and any thread is allowed to create windows and pump messages. It’s an M:N system. Windows still have thread affinity—any given window must only be updated by the thread running its message loop—but there can be multiple loops on multiple threads, each with their own set of windows.

Not so in Cocoa. The main thread, that is, the one that literally runs main() is special. All windows must have their message loops run on this thread, and all updates must be funnelled through this thread.

As I wrote in the post outlining why I want Grand Central Dispatch, the ability for secondary (worker) threads to run code from the window’s owning thread is highly desirable. In Cocoa, that means running code on the main thread, and so that’s what libdispatch enables.

Corresponding to the main thread, libdispatch creates a main queue. Since there’s only ever one main thread (used for every window), there’s only ever one main queue. Any callbacks placed on the main queue will eventually be executed by the main thread.

Creating a serial queue and enqueuing messages is easy enough; the tricky part is responding to those messages, and it’s here that libdispatch gets a bit tricky. In truth, I’m a little hazy on some of the details, because not all the plumbing is found in libdispatch; there’s also a Cocoa-side integration that I don’t think is public (if it is, I don’t know where the source is).

libdispatch has two different ways of draining the main queue. A last-ditch automatic mechanism used to ensure the right thing happens even when Cocoa isn’t actually called, and to ensure that things work properly, and a good way that integrates properly with Cocoa. The automatic mechanism leverages pthreads’ TLS destructors. A particular TLS property has a destructor that, when invoked from the main thread, will drain the main queue. Drop off the end of the main thread, either just by returning from main() or by calling dispatch_main() (which in turn calls pthread_exit()), and the destructor will be called, draining the queue.

It’s to replicate this mechanism that I investigated the feasibility of implementing TLS destructors in Win32. The implementation kinda works, but annoyingly the TLS destructor is called so late in the thread’s tear-down that it’s basically not safe to do anything, especially not make arbitrary function calls in user-supplied callbacks. Unless I can find some way of resolving this, I’ll need to find some other approach. I think the DLL notifications happen at a better time, but I really want the convenience of a static library.

This is a little annoying. Though a single main thread/main queue isn’t a natural fit for Windows, I could have created a queue per thread and used the same “drain this thread’s queue when the thread is torn down” approach. One workaround that may be effective is to give up on automatic queue draining when returning from main() and instead require dispatch_main() to be called explicitly. This would probably be good enough.

The second mechanism, which is much better as it doesn’t require ending the main thread, is the one I’m a bit less clear about. The key function here is _dispatch_main_queue_callback_4CF(). This function gets called from Cocoa’s message loop, and it drains any messages placed on the main queue, before returning control back to the message loop.

This approach should be much easier to integrate, since it doesn’t depend on any special behaviour of threads or destructors or anything; it’s just a regular function call. Every time something is put onto the main queue it alerts Cocoa (_dispatch_queue_wakeup_main()), and Cocoa then drains the queue. All easy enough to translate into Win32.

However, it’s not quite as simple as that, because of the threading model Windows uses. There isn’t any long a single “main” queue. Any thread with a message loop will have to have its own queue, and the special alerting behaviour will need to take this into account. It will also have to ensure that it alerts the right thread. This will mean altering the queue objects to include an indication of whether they’re a “special” thread queue—that is, one drained from a user thread rather than a pthread_workqueue thread—and, if so, which thread they actually belong to, so that the right thread can be alerted.

There will also need to be some way of accessing these special queues (so that callbacks can be placed on them), so some kind of HWND-to-queue and possibly thread ID or HANDLE-to-queue lookup functions will be necessary.

As luck would have it, the libdispatch test cases all depend on the main queue anyway, so before I can readily port the tests, I’m going to have to put something together to address this need.

Grand Central Dispatch for Win32: the source code

Having explained why I want to port Grand Central Dispatch to Windows and outlined some of the issues in doing so, it’s probably a good time to show some source code!

I’ve put the code into github. I’m not sure I’m entirely enamoured of github, or git in general, and I’m not even sure that I’ve pulled Apple’s source in the best possible way (I’m using the subtree merge approach instead of submodules, but am vague on the pros and cons of each mechanism).

github will tell you about all the modifications if you’re interested, but it’s probably worth mentioning a few things explicitly. I’ve attempted to change as little as possible, with the proviso that the thing has gotta build and at least give the appearance of working. The only file with wholesale changes is queue_kevent.c, which effectively has two wholly independent implementations, one using kevent(), the other using I/O completion ports.

The most disruptive modification to the source tree was the creation of the /platform hierarchy. This is where I put the Win32 stubs for UNIX headers, including the pthread_workqueue implementation. The implementation is fairly straightforward. I’ve implemented more than is strictly necessary for libdispatch—but not everything. Some concepts, such as workqueue suspension and resumption, have no obvious parallels in Win32.

I should note that I used new-style Win32 threadpools, available in Windows Vista and up. This means the code won’t work on Windows XP. The reasons for picking the new API are multiple:

  1. It can be used robustly, whereas the old one cannot; the old one provides no way of properly handling out-of-resource situations.
  2. It allows multiple pools per process, which allows libdispatch’s pools to be relatively isolated from any others that the application might create. This seems to reduce the possibility of surprises.
  3. The old threadpool API lacks any effective way of tidying up, in particular preventing callbacks from safely performing such tasks as unloading the DLL they are running from, and having no way to ensure that every callback is safely executed or deallocated.
  4. There did not seem any obvious way to implement e.g. pthread_workqueue_removeitem_np using old-style threadpools.
  5. Timer queue timers have no leeway facility.

Honestly, in this day and age, nobody should be using Windows XP. Windows Vista and Windows 7, which support the new API, are both substantial improvements on that operating system.

One of the most pervasive changes (annoyingly so, it’ll clutter up any diffs) was the insertion of the function as_do() (as in, interpret this object “as a DO” (dispatch object)). This is because Visual C++ doesn’t support gcc’s transparent_union attribute. transparent_union seems to allow a pointer to any of a union’s member types to be implicitly converted to a pointer to the union type itself, when the union type is used as a function parameter. In C++, of course, the solution would be to make the members publicly inherit a base class and use that to allow implicit upcasting.

Also in the source tree is the imaginatively-named libdispatch++. This is a very thin C++ wrapper around the C API. Normally I wouldn’t bother, except for one thing: C++0x includes lambdas. Here’s the thing: I can’t add block support to Visual C++; at best it would require a custom source-source translator to preprocess the code, at worst it’d require intrusive compiler changes. The former is more work than it’s worth; the latter I simply can’t do. With C++0x’s lambdas, the case for blocks is considerably less compelling anyway. The lambdas support essentially the same range of things that blocks do, and they’re standard to boot. They should allow usage something along the lines of:

    void main_loop(int event, void* data)
    {
       switch(event)
       {
       case OMG_SOMEONE_CLICKED_THE_BUTTON:
          gcd::dispatch_queue::get_global_queue(0, 0).async([]()->void {
             doSomethingLengthyToTheData(data);
             gcd::dispatch_queue::get_main_queue().async([]()->void {
                updateMainWindow();
             });
          });
       }
    }

(though get_main_queue() is not actually implemented yet for technical reasons). I think this compares reasonably well with the block versions. I might rename the classes to get rid of the wordy “dispatch_”.

Still to be ported are the test cases. This is obviously important, but they’re currently all written using blocks, and all are designed to be standalone executables (i.e. each test has its own main()). I don’t want to create one Visual C++ project per test case, but can’t immediately think of any good way to aggregate them without breaking anything.

Grand Central Dispatch for Win32: the port

Having established that Grand Central Dispatch would be a good thing to have on Windows, the task was to begin porting it.

The good news is that the actual implementation of Grand Central Dispatch, named libdispatch, is open source, released under the Apache license.

The bad news is that it basically won’t work with anything that isn’t Mac OS X. libdispatch depends on two particular technologies that aren’t widely available: pthread_workqueues and kevent().

pthread_workqueues are an Apple invention; they’re kernel-supported thread pools. Although some of the Mac OS X source code is open source—including that of pthread_workqueues—there’s no real documentation available.

kevent() is found on FreeBSD; in fact, that’s where Mac OS X gets it from. It’s designed as a high-performance alternative to select(), designed in particular to address two major flaws with that API. One, select() has no memory; on every single call, the entire set of descriptors of interest must be passed in to the function, even if they’re the same every time. Busy servers can waste a considerable amount of time just copy the array of descriptors into and back from kernel mode.

Two, the function requires O(n) scans. When select() returns to the caller, it indicates only whether any descriptors became ready or not. The caller than has to scan the entire descriptor array looking for any that are ready to operate on. On a busy server with thousands of concurrent connections, this scanning takes a prohibitive amount of time.

kevent() fixes this. Instead of passing the descriptors each time, a persistent kernel queue object is created by calling kqueue(). File descriptors of interest are then registered in this queue with the kevent() call. The application can then wait for the queue to signal activity, again by calling kevent(). When the function returns, it provides an array of results with activity, ending the need to perform performance-sapping O(n) scans.

A third, less significant, issue is the use of Apple’s lambda-like blocks. Fortunately, libdispatch does not use these internally, and every block-using function has an equivalent that uses a regular function pointer/void* context pointer pair. Behind the scenes, these are called by the block versions of the functions.

Other ports

The first group to work on porting libdispatch to a non-Apple platform was FreeBSD. FreeBSD was in the strongest position, of course; Mac OS X is already based on FreeBSD, which is why libdispatch uses kevent() in the first place, and similarly the modifications made by pthread_workqueues were made to a FreeBSD-derived system. The port was completed relatively quickly, and is now claimed to be stable.

Less easy are ports to other platforms. There are efforts to bring libdispatch to Linux and Solaris. User-mode implementations of pthread_workqueues are feasible (if not optimal), and effort has been put into creating mimics for kevent() that leverage the alternative facilities within those operating systems.

Still, even on those platforms, the work was reasonably simple. They already use pthreads, and their I/O models are similar in capability, just gratuitously different.

Windows

For good or ill, Windows is like nothing else on earth. It doesn’t use pthreads, and it has a very different I/O model. Neither of these are bad things as such—in fact, the reasons behind both decisions are very sound—but it means that porting Unix-oriented software like libdispatch is more work than might otherwise be the case.

While it wasn’t until Snow Leopard’s release that Mac OS X had a thread pool API, pthread_workqueues, Windows has had one since Windows 2000. In Windows Vista, a new, rather more robust thread pool API was added. For the most part, this thread pool API maps 1:1 with Apple’s pthread_workqueues. Wrapping the former to provide an API equivalent to the latter is not a major undertaking, and it works pretty well.

kevent() and I/O in general are a more difficult issue. The preferred model in Unix is that of readiness notifications. select() and kevent() both return when a file descriptor is readable without blocking; that is, when they have data available. You call the API, wait for readiness, and then do the actual read or write operation. The actual read or write in these situations is a regular synchronous blocking operation—it’s just that you know it won’t block because of the readiness notification.

Windows works on a model of completion notifications. You tell Windows to read or write to a file or socket, and it wakes up your program when that action has actually taken place. In this model, the read or write operations are non-blocking and asynchronous—the API calls return immediately. Microsoft calls it overlapped I/O.

There are a couple of ways to use overlapped I/O. The most basic is to wait for an event to be triggered when the operation has completed; do the read or write operation in one thread, wait for the event in a second thread, and use that second thread to actually respond to the operation. This works, but as might be expected, is lousy for scalability.

The better way is to use I/O completion ports. With these, one or more HANDLEs are bound to an object called a completion port. Whenever an overlapped I/O operation on any of those HANDLEs is completed, the completion port is signalled and passed the results of the operation. This means that instead of having to wait on a whole bunch of events, it’s possible to create just a few threads to respond to completion notifications, and they can process completion events for hundreds or thousands of HANDLEs.

In practice, this is a great model, although quite confusingly documented. In fact, it’s the model you probably want. Readiness notifications combined with a mix of blocking and non-blocking calls are really kind of hokey. Especially when coupled with a particularly annoying Unix trait that disk files are always deemed readable and writeable. Even though operations on disk files will actually block, select() will claim they’re ready at all times.

The downside to this is that the dispatch sources in libdispatch are designed for readiness notifications. Without using select(), with all its problems, there’s no good way to do that on Windows. This leaves two options. Cobble something together that (a) won’t be as good (b) doesn’t fit in with the natural Win32 I/O paradigm, or say screw it: instead of aiming for exact 1:1 compatibility with the Mac OS X dispatch sources, create a new overlapped I/O source that enqueues callbacks whenever an overlapped operation has completed.

It is this second route that I have taken.

Minor issues

The libdispatch source code is more or less C99 with various gcc extensions. I want to be able to use Visual C++, which only supports C89 and essentially no gcc extensions (though there are one or two non-standard features in common to both compilers). After all, there’s little point in producing a Win32-native version of the library if it’s going to force people to use the MinGW or Cygwin toolchains.

To that end, C99-style named initializers need to be replaced with C89-style aggregate initializers, some weird gcc implicit casts need to be replaced with explicit casts, and a few other bits and pieces need to be changed around.

libdispatch also, unsurprisingly, depends on the existence of various POSIX APIs. Stub libraries fill out the missing API calls; mainly simple stuff like clock_gettime().

One area that needed a little more attention is pthreads. Or rather, pthreads’ TLS. pthreads’ TLS includes a destructor feature that ensures it tidies up TLS values if a thread exits. Win32’s TLS has no equivalent capability. There is a way to work around this, but it has some issues that I have not yet resolved, and might not be able to resolve.

The good news is that this capability might be unnecessary. The use of pthread TLS destructors is to handle the special “main” thread in Cocoa. As a special compatibility feature, if the main thread exits whilst callbacks are queued on the main queue, the main thread will execute those callbacks. The preferred way to handle this, even on Mac OS X, is to explicitly call dispatch_main() to drain the main queue. Using this approach, the pthread TLS destructors aren’t needed anyway, and this style is probably more amenable to Win32, which has no main thread.

Grand Central Dispatch for Win32: why I want it

When Apple introduced Grand Central Dispatch in Mac OS X Snow Leopard the response from the fanboy crowd was predictably lunatic, with many of Apple’s fans proclaiming that Snow Leopard would, somehow, enable software to exploit the full potential of parallel processors, with little or no developer effort necessary; a belief that Apple did little or nothing to disabuse them of.

In reality, of course, it was nothing of the sort. Though it does not live up to these lofty goals—and, unsurprisingly, does not even attempt to live up to them—it is nonetheless a very interesting library indeed.

At first glance, it doesn’t look too special. It looks a bit like thread pooling, and there’s nothing fancy or innovative about thread pooling. Thread pooling is how Wikipedia describes GCD, and even Apple’s description is almost wholly fixated with threads and thread management.

It’s true that GCD does use thread pools in the background, but to concentrate on this is to miss the point. GCD’s value lies not in thread pooling, but in queuing.

Queues are at the heart of GCD. Everything happens with queues. Tasks/callbacks are placed on queues and, when dequeued, get executed. There are both parallel queues—which is where the thread pool nature of GCD becomes apparent—and serial queues—which execute strictly in order. Queues can target other queues allowing complex queue graphs to be constructed, in turn allowing simple construction of software pipelines and other complex code flows.

In tandem with GCD, Apple also introduced a new C extension that they have called “blocks”. Blocks are more or less lambdas (anonymous functions) grafted onto C, and there is also a block storage scope, the purpose of which I cannot remember at this time. Unfortunately, they have syntax that’s incompatible with C++ lambdas. Blocks and GCD are separate—you can have either one without the other—but mesh nicely together.

Don’t block the main thread

Grand Central Dispatch is appealing for many reasons. In common with many graphical/windowing toolkits, Cocoa windows are essentially single-threaded in nature. Traditionally, they were tightly restricted: any command that updates the window in any way needs to be executed on the thread owning that window. Those restrictions have been relaxed a little in 10.6, but to avoid problems it’s still a sound model to follow. However, it causes an issue in an all-too-common scenario:

  1. User clicks a button to trigger an action
  2. Action is executed on the window’s thread
  3. The action takes a long time to execute
  4. Because the window is no longer processing messages, it beachballs (Mac OS X) or ghosts (Windows)

This situation is notoriously common, and it makes applications horrible to use, as it makes their interfaces slow and unresponsive. The reason it’s so common is that generally, there’s just no good alternative. Naively, performing the action in a separate thread seems to be the way to go—you might even use a thread pool to do it—and for some actions that’s even true, but in general there’s a problem with this approach.

Typically, when the action is finished, you want to update the window somehow. And to safely update a window, the updates need to be made on the window’s thread. But if your action was spun off into its own thread, it has no good way of getting back onto the window thread. Awkward.

In Windows Presentation Foundation, which has a similar single-threaded approach, you would use Dispatcher.Invoke to make an update back on the main thread. WinForms, the older .NET toolkit that wraps Win32, had a similar mechanism. But pure Win32 and Cocoa alike both lacked any effective way to do this.

Enter Grand Central Dispatch. Grand Central Dispatch on Cocoa provides a special queue that is plumbed in to the main thread (that is, the thread on which windowing operations should be performed). With this special queue, it’s all of a sudden very easy to perform our slow action on a background thread:

    void main_loop(int event, void* data)
    {
       switch(event)
       {
       case OMG_SOMEONE_CLICKED_THE_BUTTON:
          dispatch_async(dispatch_get_global_queue(0, 0), ^{
             doSomethingLengthyToTheData(data);
             dispatch_async(dispatch_get_main_queue(), ^{
                updateMainWindow();
             });
          });
       }
    }

Notice the use of blocks. Blocks are introduced with a caret, ^, followed by an optional argument list (omitted here, because neither block has parameters anyway), followed by the block body inside braces. ^ is also used as a declarator (equivalent to * for pointers), to allow block types to be declared, using syntax equivalent to that of function pointers. This alone is a killer feature.

Asynchronous programming made easy

The second highly desirable capability is provided by what GCD calls “sources”. These are objects that add tasks to a queue autonomously. One obvious type is timer sources. Create a timer source, specify a block/callback, and set its schedule. Then, whenever the timer fires, it’ll put the callback onto whichever queue the source is targeting, and the callback will get run as soon as it is dequeued.

The most exciting source types are the read and write sources. These sources are bound to file descriptors, and trigger the callbacks whenever the descriptors change to become ready for reading or writing, respectively. This provides a very nice building block for applications built around non-blocking I/O. Instead of dealing with the whole mess of select() (or other, better-performing alternatives), you simply create a source, and tell it which code you want it to execute whenever it’s readable or writeable. This is so much nicer it’s not funny.

GCD can, of course, be used as a parallel programming library. It has a basic data-parallel operation, dispatch_apply, that will queue the specified callback n times. This can be used to process arrays in parallel quite easily. Though this is nice enough for simple tasks—“encode every file in this array as an MP3”, say—but it’s quite rudimentary. I wouldn’t want to use it for array-based number crunching, for example: OpenMP provides a better set of primitives for this kind of task.

But for asynchronous development and software pipelining, Grand Central Dispatch is really hard to beat. The API is quite small and simple, making it easy to learn, its programming model is natural, making it easy to integrate into existing codebases, and the functionality it provides is highly desirable.

The big problem is that it’s only available under Mac OS X, with a port also available on FreeBSD. Most of my time these days is spent using and programming Windows—but I want Grand Central Dispatch, and want it badly.

So that’s why I’ve ported it.

Grand Central Dispatch for Win32

So, I’m trying to put together a Grand Central Dispatch workalike for Win32, using the Vista/7 threadpool system behind the scenes.

The essence of GCD is pretty simple, but there are enough wrinkles (target queues and suspension) that a naive implementation won’t work. I can’t just dump things into the pool and forget about them, so instead I have to maintain my own queues of work items and plop them into the pool myself.

Apple’s implementation has the same issue; behind the scenes it uses Apple’s proprietary “pthread_workqueues”, but it layers its own queuing on top.

It’s an interesting experience. One thing I’ve found is that for all the noise that people made about GCD when Snow Leopard was in beta, there’s vanishingly little content about it on the Web. Either nobody is actually using it (which is a pity, because it enables quite a nice coding style for GUI applications), their uses are all extremely simple (such that they only use a relatively small subset of its features), or they’re just not talking about it.

The latter seems a bit strange, because some of its features aren’t entirely obvious. The utility of the ability to set target queues, in particular, is not at all clear.

I wonder if perhaps it’s not useful as such, but rather a feature they added for their NSOperationQueue stuff, which layers on top of GCD to provide some richer capabilities. NSOperationQueue is used to run NSOperation objects, and those objects can have dependencies on other NSOperations. This allows a DAG of NSOperations to be constructed. My speculation is that perhaps it constructs an equivalent DAG of GCD queues, and this is why GCD has the targeting capability. But I don’t really know for sure.

The big sticking point with any implementation will be dispatch sources. Windows has no direct equivalent (its asynchronous/overlapped I/O mechanism works in a fundamentally different way), so it’s going to require some thinking. It would be nice to be able to support dispatch sources for file/socket I/O, and something similar should be feasible. But that will have to wait until I have the fundamentals done.