Monday, April 25, 2011

Grand Central Dispatch for Win32: the port

Having established that Grand Central Dispatch would be a good thing to have on Windows, the task was to begin porting it.

The good news is that the actual implementation of Grand Central Dispatch, named libdispatch, is open source, released under the Apache license.

The bad news is that it basically won’t work with anything that isn’t Mac OS X. libdispatch depends on two particular technologies that aren’t widely available: pthread_workqueues and kevent().

pthread_workqueues are an Apple invention; they’re kernel-supported thread pools. Although some of the Mac OS X source code is open source—including that of pthread_workqueues—there’s no real documentation available.

kevent() is found on FreeBSD; in fact, that’s where Mac OS X gets it from. It’s designed as a high-performance alternative to select(), designed in particular to address two major flaws with that API. One, select() has no memory; on every single call, the entire set of descriptors of interest must be passed in to the function, even if they’re the same every time. Busy servers can waste a considerable amount of time just copy the array of descriptors into and back from kernel mode.

Two, the function requires O(n) scans. When select() returns to the caller, it indicates only whether any descriptors became ready or not. The caller than has to scan the entire descriptor array looking for any that are ready to operate on. On a busy server with thousands of concurrent connections, this scanning takes a prohibitive amount of time.

kevent() fixes this. Instead of passing the descriptors each time, a persistent kernel queue object is created by calling kqueue(). File descriptors of interest are then registered in this queue with the kevent() call. The application can then wait for the queue to signal activity, again by calling kevent(). When the function returns, it provides an array of results with activity, ending the need to perform performance-sapping O(n) scans.

A third, less significant, issue is the use of Apple’s lambda-like blocks. Fortunately, libdispatch does not use these internally, and every block-using function has an equivalent that uses a regular function pointer/void* context pointer pair. Behind the scenes, these are called by the block versions of the functions.

Other ports

The first group to work on porting libdispatch to a non-Apple platform was FreeBSD. FreeBSD was in the strongest position, of course; Mac OS X is already based on FreeBSD, which is why libdispatch uses kevent() in the first place, and similarly the modifications made by pthread_workqueues were made to a FreeBSD-derived system. The port was completed relatively quickly, and is now claimed to be stable.

Less easy are ports to other platforms. There are efforts to bring libdispatch to Linux and Solaris. User-mode implementations of pthread_workqueues are feasible (if not optimal), and effort has been put into creating mimics for kevent() that leverage the alternative facilities within those operating systems.

Still, even on those platforms, the work was reasonably simple. They already use pthreads, and their I/O models are similar in capability, just gratuitously different.

Windows

For good or ill, Windows is like nothing else on earth. It doesn’t use pthreads, and it has a very different I/O model. Neither of these are bad things as such—in fact, the reasons behind both decisions are very sound—but it means that porting Unix-oriented software like libdispatch is more work than might otherwise be the case.

While it wasn’t until Snow Leopard’s release that Mac OS X had a thread pool API, pthread_workqueues, Windows has had one since Windows 2000. In Windows Vista, a new, rather more robust thread pool API was added. For the most part, this thread pool API maps 1:1 with Apple’s pthread_workqueues. Wrapping the former to provide an API equivalent to the latter is not a major undertaking, and it works pretty well.

kevent() and I/O in general are a more difficult issue. The preferred model in Unix is that of readiness notifications. select() and kevent() both return when a file descriptor is readable without blocking; that is, when they have data available. You call the API, wait for readiness, and then do the actual read or write operation. The actual read or write in these situations is a regular synchronous blocking operation—it’s just that you know it won’t block because of the readiness notification.

Windows works on a model of completion notifications. You tell Windows to read or write to a file or socket, and it wakes up your program when that action has actually taken place. In this model, the read or write operations are non-blocking and asynchronous—the API calls return immediately. Microsoft calls it overlapped I/O.

There are a couple of ways to use overlapped I/O. The most basic is to wait for an event to be triggered when the operation has completed; do the read or write operation in one thread, wait for the event in a second thread, and use that second thread to actually respond to the operation. This works, but as might be expected, is lousy for scalability.

The better way is to use I/O completion ports. With these, one or more HANDLEs are bound to an object called a completion port. Whenever an overlapped I/O operation on any of those HANDLEs is completed, the completion port is signalled and passed the results of the operation. This means that instead of having to wait on a whole bunch of events, it’s possible to create just a few threads to respond to completion notifications, and they can process completion events for hundreds or thousands of HANDLEs.

In practice, this is a great model, although quite confusingly documented. In fact, it’s the model you probably want. Readiness notifications combined with a mix of blocking and non-blocking calls are really kind of hokey. Especially when coupled with a particularly annoying Unix trait that disk files are always deemed readable and writeable. Even though operations on disk files will actually block, select() will claim they’re ready at all times.

The downside to this is that the dispatch sources in libdispatch are designed for readiness notifications. Without using select(), with all its problems, there’s no good way to do that on Windows. This leaves two options. Cobble something together that (a) won’t be as good (b) doesn’t fit in with the natural Win32 I/O paradigm, or say screw it: instead of aiming for exact 1:1 compatibility with the Mac OS X dispatch sources, create a new overlapped I/O source that enqueues callbacks whenever an overlapped operation has completed.

It is this second route that I have taken.

Minor issues

The libdispatch source code is more or less C99 with various gcc extensions. I want to be able to use Visual C++, which only supports C89 and essentially no gcc extensions (though there are one or two non-standard features in common to both compilers). After all, there’s little point in producing a Win32-native version of the library if it’s going to force people to use the MinGW or Cygwin toolchains.

To that end, C99-style named initializers need to be replaced with C89-style aggregate initializers, some weird gcc implicit casts need to be replaced with explicit casts, and a few other bits and pieces need to be changed around.

libdispatch also, unsurprisingly, depends on the existence of various POSIX APIs. Stub libraries fill out the missing API calls; mainly simple stuff like clock_gettime().

One area that needed a little more attention is pthreads. Or rather, pthreads’ TLS. pthreads’ TLS includes a destructor feature that ensures it tidies up TLS values if a thread exits. Win32’s TLS has no equivalent capability. There is a way to work around this, but it has some issues that I have not yet resolved, and might not be able to resolve.

The good news is that this capability might be unnecessary. The use of pthread TLS destructors is to handle the special “main” thread in Cocoa. As a special compatibility feature, if the main thread exits whilst callbacks are queued on the main queue, the main thread will execute those callbacks. The preferred way to handle this, even on Mac OS X, is to explicitly call dispatch_main() to drain the main queue. Using this approach, the pthread TLS destructors aren’t needed anyway, and this style is probably more amenable to Win32, which has no main thread.