The first multi-GPU implementation of N-body is done, and it’s old school: using the same thread delegation code as the multithreaded CPU implementation, it uses a separate CPU thread for each GPU.
For workloads like N-body, that’s probably not necessary – N-body is so GPU-bound that the CPU is more of a traffic cop than an active contributor to the computation – but in a world where 4-core CPUs are common and 16-core CPUs available, it seems terribly wasteful to drive multiple GPUs from a single core. The single-threaded multi-GPU implementation is forthcoming, so we’ll know soon whether it’s a win on this particular app.
The key element of multi-GPU programming is portable pinned memory. We added this feature in CUDA 2.2, along with a host of other pinned memory-related features (mainly mapped pinned memory, but also the ability to allocate write-combining pinned memory). Unlike previous versions of CUDA, where only the context that called cudaMallocHost() could remember that the allocation was pinned, in CUDA 2.2, portable host memory allocations are available to all contexts. As a result, the application realizes the benefits of pinned memory (say, that it can be specified to cudaMemcpyAsync()) for every CUDA context – and every GPU – in the system.
Adding portable pinned memory and mapped pinned memory in the same release forced us to resolve some API design problems that might have been difficult, if we had shipped one feature and not the other. cudaHostAlloc() can perform a portable allocation, and optionally also map that memory into the CUDA address space(s) of the GPU(s) in the system. There was a strong temptation to pass back both the host and device pointer in that one API call, but at the time we could not guarantee that every GPU in a multi-GPU system would get the same GPU address. So we separated the request that host memory be mapped (cudaHostAlloc()) from the query to obtain the device pointer corresponding to a given host allocation (cudaHostGetDevicePointer()).
With CUDA 4.0, unified virtual addressing does guarantee that the device and host pointers are all the same – but when CUDA 2.2 was being designed, the GPUs did not support 64-bit addressing and, in any case, even today not all operating systems support UVA!
CUDA 4.0 also simplified multi-GPU programming in another way: it enabled CUDA runtime applications to drive multiple GPUs from a single CPU thread*. Before CUDA 4.0, the only way to drive multiple GPUs was to spawn a thread per GPU. cudaSetDevice() had to be called before any CUDA initialization had occurred and, once it was called, that CPU thread henceforce could operate that device and only that device. This awkward API semantic – reminiscent of OpenGL and its “current context” – was definitely not the end product we wanted, but it was extremely challenging to make the CUDA driver thread-safe without incurring performance penalties due to excessive contention on the needed mutexes.
But after the driver was made thread-safe, it became possible for the cudaSetDevice() call to implement the semantic that developers tended to expect: after a cudaSetDevice() call, the corresponding CUDA context is made current to that thread and subsequent CUDA API calls and kernel launches are performed on the GPU corresponding to that context.
It will not take long to get the single-threaded multi-GPU implementation of N-body going, and that will conclude the multi-GPU (Chapter 9) coverage in the book. (UPDATE: the single-threaded implementation that uses cudaSetDevice() is done, and performance testing confirms that the performance difference on this workload is negligible.)
* CUDA 2.2 added the “current context” stack that could be manipulated with the driver API’s cuCtxPopCurrent()/cuCtxPushCurrent(), but few developers use the driver API.