Monthly Archives: April 2011

Grand Central Dispatch for Win32: things still to do

So, the libdispatch port I’ve been working on is currently quite rough and ready. The major parts all seem to work, though I need to migrate all the tests, but there’s one significant piece missing: the main queue.

Cocoa is, for the most part, single-threaded; updates to a window must be performed on the the thread that owns that window. The same is true of WPF, Win32, and others. However, Cocoa takes things a little further than Win32. In Win32 all threads are essentially created equal, and any thread is allowed to create windows and pump messages. It’s an M:N system. Windows still have thread affinity—any given window must only be updated by the thread running its message loop—but there can be multiple loops on multiple threads, each with their own set of windows.

Not so in Cocoa. The main thread, that is, the one that literally runs main() is special. All windows must have their message loops run on this thread, and all updates must be funnelled through this thread.

As I wrote in the post outlining why I want Grand Central Dispatch, the ability for secondary (worker) threads to run code from the window’s owning thread is highly desirable. In Cocoa, that means running code on the main thread, and so that’s what libdispatch enables.

Corresponding to the main thread, libdispatch creates a main queue. Since there’s only ever one main thread (used for every window), there’s only ever one main queue. Any callbacks placed on the main queue will eventually be executed by the main thread.

Creating a serial queue and enqueuing messages is easy enough; the tricky part is responding to those messages, and it’s here that libdispatch gets a bit tricky. In truth, I’m a little hazy on some of the details, because not all the plumbing is found in libdispatch; there’s also a Cocoa-side integration that I don’t think is public (if it is, I don’t know where the source is).

libdispatch has two different ways of draining the main queue. A last-ditch automatic mechanism used to ensure the right thing happens even when Cocoa isn’t actually called, and to ensure that things work properly, and a good way that integrates properly with Cocoa. The automatic mechanism leverages pthreads’ TLS destructors. A particular TLS property has a destructor that, when invoked from the main thread, will drain the main queue. Drop off the end of the main thread, either just by returning from main() or by calling dispatch_main() (which in turn calls pthread_exit()), and the destructor will be called, draining the queue.

It’s to replicate this mechanism that I investigated the feasibility of implementing TLS destructors in Win32. The implementation kinda works, but annoyingly the TLS destructor is called so late in the thread’s tear-down that it’s basically not safe to do anything, especially not make arbitrary function calls in user-supplied callbacks. Unless I can find some way of resolving this, I’ll need to find some other approach. I think the DLL notifications happen at a better time, but I really want the convenience of a static library.

This is a little annoying. Though a single main thread/main queue isn’t a natural fit for Windows, I could have created a queue per thread and used the same “drain this thread’s queue when the thread is torn down” approach. One workaround that may be effective is to give up on automatic queue draining when returning from main() and instead require dispatch_main() to be called explicitly. This would probably be good enough.

The second mechanism, which is much better as it doesn’t require ending the main thread, is the one I’m a bit less clear about. The key function here is _dispatch_main_queue_callback_4CF(). This function gets called from Cocoa’s message loop, and it drains any messages placed on the main queue, before returning control back to the message loop.

This approach should be much easier to integrate, since it doesn’t depend on any special behaviour of threads or destructors or anything; it’s just a regular function call. Every time something is put onto the main queue it alerts Cocoa (_dispatch_queue_wakeup_main()), and Cocoa then drains the queue. All easy enough to translate into Win32.

However, it’s not quite as simple as that, because of the threading model Windows uses. There isn’t any long a single “main” queue. Any thread with a message loop will have to have its own queue, and the special alerting behaviour will need to take this into account. It will also have to ensure that it alerts the right thread. This will mean altering the queue objects to include an indication of whether they’re a “special” thread queue—that is, one drained from a user thread rather than a pthread_workqueue thread—and, if so, which thread they actually belong to, so that the right thread can be alerted.

There will also need to be some way of accessing these special queues (so that callbacks can be placed on them), so some kind of HWND-to-queue and possibly thread ID or HANDLE-to-queue lookup functions will be necessary.

As luck would have it, the libdispatch test cases all depend on the main queue anyway, so before I can readily port the tests, I’m going to have to put something together to address this need.

Grand Central Dispatch for Win32: the source code

Having explained why I want to port Grand Central Dispatch to Windows and outlined some of the issues in doing so, it’s probably a good time to show some source code!

I’ve put the code into github. I’m not sure I’m entirely enamoured of github, or git in general, and I’m not even sure that I’ve pulled Apple’s source in the best possible way (I’m using the subtree merge approach instead of submodules, but am vague on the pros and cons of each mechanism).

github will tell you about all the modifications if you’re interested, but it’s probably worth mentioning a few things explicitly. I’ve attempted to change as little as possible, with the proviso that the thing has gotta build and at least give the appearance of working. The only file with wholesale changes is queue_kevent.c, which effectively has two wholly independent implementations, one using kevent(), the other using I/O completion ports.

The most disruptive modification to the source tree was the creation of the /platform hierarchy. This is where I put the Win32 stubs for UNIX headers, including the pthread_workqueue implementation. The implementation is fairly straightforward. I’ve implemented more than is strictly necessary for libdispatch—but not everything. Some concepts, such as workqueue suspension and resumption, have no obvious parallels in Win32.

I should note that I used new-style Win32 threadpools, available in Windows Vista and up. This means the code won’t work on Windows XP. The reasons for picking the new API are multiple:

  1. It can be used robustly, whereas the old one cannot; the old one provides no way of properly handling out-of-resource situations.
  2. It allows multiple pools per process, which allows libdispatch’s pools to be relatively isolated from any others that the application might create. This seems to reduce the possibility of surprises.
  3. The old threadpool API lacks any effective way of tidying up, in particular preventing callbacks from safely performing such tasks as unloading the DLL they are running from, and having no way to ensure that every callback is safely executed or deallocated.
  4. There did not seem any obvious way to implement e.g. pthread_workqueue_removeitem_np using old-style threadpools.
  5. Timer queue timers have no leeway facility.

Honestly, in this day and age, nobody should be using Windows XP. Windows Vista and Windows 7, which support the new API, are both substantial improvements on that operating system.

One of the most pervasive changes (annoyingly so, it’ll clutter up any diffs) was the insertion of the function as_do() (as in, interpret this object “as a DO” (dispatch object)). This is because Visual C++ doesn’t support gcc’s transparent_union attribute. transparent_union seems to allow a pointer to any of a union’s member types to be implicitly converted to a pointer to the union type itself, when the union type is used as a function parameter. In C++, of course, the solution would be to make the members publicly inherit a base class and use that to allow implicit upcasting.

Also in the source tree is the imaginatively-named libdispatch++. This is a very thin C++ wrapper around the C API. Normally I wouldn’t bother, except for one thing: C++0x includes lambdas. Here’s the thing: I can’t add block support to Visual C++; at best it would require a custom source-source translator to preprocess the code, at worst it’d require intrusive compiler changes. The former is more work than it’s worth; the latter I simply can’t do. With C++0x’s lambdas, the case for blocks is considerably less compelling anyway. The lambdas support essentially the same range of things that blocks do, and they’re standard to boot. They should allow usage something along the lines of:

    void main_loop(int event, void* data)
    {
       switch(event)
       {
       case OMG_SOMEONE_CLICKED_THE_BUTTON:
          gcd::dispatch_queue::get_global_queue(0, 0).async([]()->void {
             doSomethingLengthyToTheData(data);
             gcd::dispatch_queue::get_main_queue().async([]()->void {
                updateMainWindow();
             });
          });
       }
    }

(though get_main_queue() is not actually implemented yet for technical reasons). I think this compares reasonably well with the block versions. I might rename the classes to get rid of the wordy “dispatch_”.

Still to be ported are the test cases. This is obviously important, but they’re currently all written using blocks, and all are designed to be standalone executables (i.e. each test has its own main()). I don’t want to create one Visual C++ project per test case, but can’t immediately think of any good way to aggregate them without breaking anything.

Grand Central Dispatch for Win32: the port

Having established that Grand Central Dispatch would be a good thing to have on Windows, the task was to begin porting it.

The good news is that the actual implementation of Grand Central Dispatch, named libdispatch, is open source, released under the Apache license.

The bad news is that it basically won’t work with anything that isn’t Mac OS X. libdispatch depends on two particular technologies that aren’t widely available: pthread_workqueues and kevent().

pthread_workqueues are an Apple invention; they’re kernel-supported thread pools. Although some of the Mac OS X source code is open source—including that of pthread_workqueues—there’s no real documentation available.

kevent() is found on FreeBSD; in fact, that’s where Mac OS X gets it from. It’s designed as a high-performance alternative to select(), designed in particular to address two major flaws with that API. One, select() has no memory; on every single call, the entire set of descriptors of interest must be passed in to the function, even if they’re the same every time. Busy servers can waste a considerable amount of time just copy the array of descriptors into and back from kernel mode.

Two, the function requires O(n) scans. When select() returns to the caller, it indicates only whether any descriptors became ready or not. The caller than has to scan the entire descriptor array looking for any that are ready to operate on. On a busy server with thousands of concurrent connections, this scanning takes a prohibitive amount of time.

kevent() fixes this. Instead of passing the descriptors each time, a persistent kernel queue object is created by calling kqueue(). File descriptors of interest are then registered in this queue with the kevent() call. The application can then wait for the queue to signal activity, again by calling kevent(). When the function returns, it provides an array of results with activity, ending the need to perform performance-sapping O(n) scans.

A third, less significant, issue is the use of Apple’s lambda-like blocks. Fortunately, libdispatch does not use these internally, and every block-using function has an equivalent that uses a regular function pointer/void* context pointer pair. Behind the scenes, these are called by the block versions of the functions.

Other ports

The first group to work on porting libdispatch to a non-Apple platform was FreeBSD. FreeBSD was in the strongest position, of course; Mac OS X is already based on FreeBSD, which is why libdispatch uses kevent() in the first place, and similarly the modifications made by pthread_workqueues were made to a FreeBSD-derived system. The port was completed relatively quickly, and is now claimed to be stable.

Less easy are ports to other platforms. There are efforts to bring libdispatch to Linux and Solaris. User-mode implementations of pthread_workqueues are feasible (if not optimal), and effort has been put into creating mimics for kevent() that leverage the alternative facilities within those operating systems.

Still, even on those platforms, the work was reasonably simple. They already use pthreads, and their I/O models are similar in capability, just gratuitously different.

Windows

For good or ill, Windows is like nothing else on earth. It doesn’t use pthreads, and it has a very different I/O model. Neither of these are bad things as such—in fact, the reasons behind both decisions are very sound—but it means that porting Unix-oriented software like libdispatch is more work than might otherwise be the case.

While it wasn’t until Snow Leopard’s release that Mac OS X had a thread pool API, pthread_workqueues, Windows has had one since Windows 2000. In Windows Vista, a new, rather more robust thread pool API was added. For the most part, this thread pool API maps 1:1 with Apple’s pthread_workqueues. Wrapping the former to provide an API equivalent to the latter is not a major undertaking, and it works pretty well.

kevent() and I/O in general are a more difficult issue. The preferred model in Unix is that of readiness notifications. select() and kevent() both return when a file descriptor is readable without blocking; that is, when they have data available. You call the API, wait for readiness, and then do the actual read or write operation. The actual read or write in these situations is a regular synchronous blocking operation—it’s just that you know it won’t block because of the readiness notification.

Windows works on a model of completion notifications. You tell Windows to read or write to a file or socket, and it wakes up your program when that action has actually taken place. In this model, the read or write operations are non-blocking and asynchronous—the API calls return immediately. Microsoft calls it overlapped I/O.

There are a couple of ways to use overlapped I/O. The most basic is to wait for an event to be triggered when the operation has completed; do the read or write operation in one thread, wait for the event in a second thread, and use that second thread to actually respond to the operation. This works, but as might be expected, is lousy for scalability.

The better way is to use I/O completion ports. With these, one or more HANDLEs are bound to an object called a completion port. Whenever an overlapped I/O operation on any of those HANDLEs is completed, the completion port is signalled and passed the results of the operation. This means that instead of having to wait on a whole bunch of events, it’s possible to create just a few threads to respond to completion notifications, and they can process completion events for hundreds or thousands of HANDLEs.

In practice, this is a great model, although quite confusingly documented. In fact, it’s the model you probably want. Readiness notifications combined with a mix of blocking and non-blocking calls are really kind of hokey. Especially when coupled with a particularly annoying Unix trait that disk files are always deemed readable and writeable. Even though operations on disk files will actually block, select() will claim they’re ready at all times.

The downside to this is that the dispatch sources in libdispatch are designed for readiness notifications. Without using select(), with all its problems, there’s no good way to do that on Windows. This leaves two options. Cobble something together that (a) won’t be as good (b) doesn’t fit in with the natural Win32 I/O paradigm, or say screw it: instead of aiming for exact 1:1 compatibility with the Mac OS X dispatch sources, create a new overlapped I/O source that enqueues callbacks whenever an overlapped operation has completed.

It is this second route that I have taken.

Minor issues

The libdispatch source code is more or less C99 with various gcc extensions. I want to be able to use Visual C++, which only supports C89 and essentially no gcc extensions (though there are one or two non-standard features in common to both compilers). After all, there’s little point in producing a Win32-native version of the library if it’s going to force people to use the MinGW or Cygwin toolchains.

To that end, C99-style named initializers need to be replaced with C89-style aggregate initializers, some weird gcc implicit casts need to be replaced with explicit casts, and a few other bits and pieces need to be changed around.

libdispatch also, unsurprisingly, depends on the existence of various POSIX APIs. Stub libraries fill out the missing API calls; mainly simple stuff like clock_gettime().

One area that needed a little more attention is pthreads. Or rather, pthreads’ TLS. pthreads’ TLS includes a destructor feature that ensures it tidies up TLS values if a thread exits. Win32’s TLS has no equivalent capability. There is a way to work around this, but it has some issues that I have not yet resolved, and might not be able to resolve.

The good news is that this capability might be unnecessary. The use of pthread TLS destructors is to handle the special “main” thread in Cocoa. As a special compatibility feature, if the main thread exits whilst callbacks are queued on the main queue, the main thread will execute those callbacks. The preferred way to handle this, even on Mac OS X, is to explicitly call dispatch_main() to drain the main queue. Using this approach, the pthread TLS destructors aren’t needed anyway, and this style is probably more amenable to Win32, which has no main thread.

Grand Central Dispatch for Win32: why I want it

When Apple introduced Grand Central Dispatch in Mac OS X Snow Leopard the response from the fanboy crowd was predictably lunatic, with many of Apple’s fans proclaiming that Snow Leopard would, somehow, enable software to exploit the full potential of parallel processors, with little or no developer effort necessary; a belief that Apple did little or nothing to disabuse them of.

In reality, of course, it was nothing of the sort. Though it does not live up to these lofty goals—and, unsurprisingly, does not even attempt to live up to them—it is nonetheless a very interesting library indeed.

At first glance, it doesn’t look too special. It looks a bit like thread pooling, and there’s nothing fancy or innovative about thread pooling. Thread pooling is how Wikipedia describes GCD, and even Apple’s description is almost wholly fixated with threads and thread management.

It’s true that GCD does use thread pools in the background, but to concentrate on this is to miss the point. GCD’s value lies not in thread pooling, but in queuing.

Queues are at the heart of GCD. Everything happens with queues. Tasks/callbacks are placed on queues and, when dequeued, get executed. There are both parallel queues—which is where the thread pool nature of GCD becomes apparent—and serial queues—which execute strictly in order. Queues can target other queues allowing complex queue graphs to be constructed, in turn allowing simple construction of software pipelines and other complex code flows.

In tandem with GCD, Apple also introduced a new C extension that they have called “blocks”. Blocks are more or less lambdas (anonymous functions) grafted onto C, and there is also a block storage scope, the purpose of which I cannot remember at this time. Unfortunately, they have syntax that’s incompatible with C++ lambdas. Blocks and GCD are separate—you can have either one without the other—but mesh nicely together.

Don’t block the main thread

Grand Central Dispatch is appealing for many reasons. In common with many graphical/windowing toolkits, Cocoa windows are essentially single-threaded in nature. Traditionally, they were tightly restricted: any command that updates the window in any way needs to be executed on the thread owning that window. Those restrictions have been relaxed a little in 10.6, but to avoid problems it’s still a sound model to follow. However, it causes an issue in an all-too-common scenario:

  1. User clicks a button to trigger an action
  2. Action is executed on the window’s thread
  3. The action takes a long time to execute
  4. Because the window is no longer processing messages, it beachballs (Mac OS X) or ghosts (Windows)

This situation is notoriously common, and it makes applications horrible to use, as it makes their interfaces slow and unresponsive. The reason it’s so common is that generally, there’s just no good alternative. Naively, performing the action in a separate thread seems to be the way to go—you might even use a thread pool to do it—and for some actions that’s even true, but in general there’s a problem with this approach.

Typically, when the action is finished, you want to update the window somehow. And to safely update a window, the updates need to be made on the window’s thread. But if your action was spun off into its own thread, it has no good way of getting back onto the window thread. Awkward.

In Windows Presentation Foundation, which has a similar single-threaded approach, you would use Dispatcher.Invoke to make an update back on the main thread. WinForms, the older .NET toolkit that wraps Win32, had a similar mechanism. But pure Win32 and Cocoa alike both lacked any effective way to do this.

Enter Grand Central Dispatch. Grand Central Dispatch on Cocoa provides a special queue that is plumbed in to the main thread (that is, the thread on which windowing operations should be performed). With this special queue, it’s all of a sudden very easy to perform our slow action on a background thread:

    void main_loop(int event, void* data)
    {
       switch(event)
       {
       case OMG_SOMEONE_CLICKED_THE_BUTTON:
          dispatch_async(dispatch_get_global_queue(0, 0), ^{
             doSomethingLengthyToTheData(data);
             dispatch_async(dispatch_get_main_queue(), ^{
                updateMainWindow();
             });
          });
       }
    }

Notice the use of blocks. Blocks are introduced with a caret, ^, followed by an optional argument list (omitted here, because neither block has parameters anyway), followed by the block body inside braces. ^ is also used as a declarator (equivalent to * for pointers), to allow block types to be declared, using syntax equivalent to that of function pointers. This alone is a killer feature.

Asynchronous programming made easy

The second highly desirable capability is provided by what GCD calls “sources”. These are objects that add tasks to a queue autonomously. One obvious type is timer sources. Create a timer source, specify a block/callback, and set its schedule. Then, whenever the timer fires, it’ll put the callback onto whichever queue the source is targeting, and the callback will get run as soon as it is dequeued.

The most exciting source types are the read and write sources. These sources are bound to file descriptors, and trigger the callbacks whenever the descriptors change to become ready for reading or writing, respectively. This provides a very nice building block for applications built around non-blocking I/O. Instead of dealing with the whole mess of select() (or other, better-performing alternatives), you simply create a source, and tell it which code you want it to execute whenever it’s readable or writeable. This is so much nicer it’s not funny.

GCD can, of course, be used as a parallel programming library. It has a basic data-parallel operation, dispatch_apply, that will queue the specified callback n times. This can be used to process arrays in parallel quite easily. Though this is nice enough for simple tasks—“encode every file in this array as an MP3”, say—but it’s quite rudimentary. I wouldn’t want to use it for array-based number crunching, for example: OpenMP provides a better set of primitives for this kind of task.

But for asynchronous development and software pipelining, Grand Central Dispatch is really hard to beat. The API is quite small and simple, making it easy to learn, its programming model is natural, making it easy to integrate into existing codebases, and the functionality it provides is highly desirable.

The big problem is that it’s only available under Mac OS X, with a port also available on FreeBSD. Most of my time these days is spent using and programming Windows—but I want Grand Central Dispatch, and want it badly.

So that’s why I’ve ported it.