CUDA 5.0 adds a new feature called “stream callbacks,” a new mechanism for CPU/GPU synchronization. Previously, CPU/GPU synchronization was accomplished by calling functions like cuStreamSynchronize(), which returns when all preceding commands in the stream have been completed, or cuEventSynchronize(), which waits until the specified event has been recorded by the GPU.
I spent some time investigating how stream callbacks are implemented on Windows 7. cudaStreamAddCallback() specifies a callback that will be called once, when all preceding operations in the stream have completed. CUDA makes no representation about when this callback will be performed, or the order in which the callbacks will be called, except that callbacks in a given stream will be called in the same order they were specified to that stream.
It is hard to imagine how NVIDIA implemented stream callbacks without creating at least one extra CPU thread, something that we never did while I was working on CUDA. We always knew that we could do useful things by creating CPU threads without the developer’s knowledge or involvement, like asynchronous memcpy of pageable memory. But, we were reluctant to create threads without an opt-in or some other sign that the developer wanted extra threads running around in their process. Also, making the driver fully thread-safe was a very difficult problem, not solved until CUDA 4.0.
So I wrote an application to look at the number of active threads in the system (and note, “the number of active threads” is always subject to change in a multithreaded operating system – in Windows, you take a “snapshot” of the process, presumably implemented using copy-on-write semantics like fork() in old-time UNIX, and count the active threads in the snapshot) using code from this MSDN page.
I learned some interesting things. First of all, the CUDA runtime creates 2 threads at initialization time – my app reports 3 active threads after cudaFree(0) has been called. (I had to add a one-second sleep to my app to detect this, since snapshotting and counting the active threads immediately after calling cudaFree() still yielded a thread count of 1 – the application thread). Using the CUDA driver API, I confirmed that the CUDA driver does not create any extra threads at initialization time (cuInit()) but does create them along with the context (cuCtxCreate()).
Anyway, it seems the first time cudaStreamAddCallback() is called, yet another CPU thread is created for purposes of calling into stream callbacks. I instrumented my app to detect and report a change in the ID of the calling thread, and confirmed that only one thread is ever used for this purpose. (But it would seem unwise to rely on that assumption!)
It’s important not to make any assumptions about the threading model for stream callbacks. Make sure your code is thread-safe. The verbiage in the CUDA C Programming Guide doesn’t exactly lend confidence, either:
[C]allbacks must not use any CUDA APIs. They can, however, make blocking calls but should not create a transitive dependency on CUDA APIs that are not guaranteed to be ready before the callback. For example, waiting for a thread that is waiting on cudaStreamSynchronize() is as bad as calling it directly. Callbacks carry the same scheduling restrictions as other commands issued on streams.
They, in a given stream, execute in the order in which they are added. However, false dependencies between commands issued to different streams (refer to Streams) can introduce false dependencies between callbacks from different streams and between kernels and callbacks in different streams. Further, as the execution order of callbacks is not guaranteed, it is not safe to have synchronization in user code between callbacks.
“Blocking” callbacks are ones that suspend execution of the CUDA stream until they’ve returned.
So what are stream callbacks good for? Frankly, I am not yet sure. I don’t think they enable anything that wasn’t possible before – they seem to be implemented in terms of previously-available CUDA abstractions (a combination of blocking events and having the CUDA driver’s CPU thread alternate between waiting for a CUDA event and calling into the application’s callback). Maybe they’re intended to enable something on applications that use nested parallelism.
In the meantime, if stream callbacks provide a more convenient mechanism to implement your application, no one will fault you for using them.