Monthly Archives: November 2012

N-Body Update: Multi-GPU

The first multi-GPU implementation of N-body is done, and it’s old school: using the same thread delegation code as the multithreaded CPU implementation, it uses a separate CPU thread for each GPU.

For workloads like N-body, that’s probably not necessary – N-body is so GPU-bound that the CPU is more of a traffic cop than an active contributor to the computation – but in a world where 4-core CPUs are common and 16-core CPUs available, it seems terribly wasteful to drive multiple GPUs from a single core. The single-threaded multi-GPU implementation is forthcoming, so we’ll know soon whether it’s a win on this particular app.

The key element of multi-GPU programming is portable pinned memory. We added this feature in CUDA 2.2, along with a host of other pinned memory-related features (mainly mapped pinned memory, but also the ability to allocate write-combining pinned memory). Unlike previous versions of CUDA, where only the context that called cudaMallocHost() could remember that the allocation was pinned, in CUDA 2.2, portable host memory allocations are available to all contexts. As a result, the application realizes the benefits of pinned memory (say, that it can be specified to cudaMemcpyAsync()) for every CUDA context – and every GPU – in the system.

Adding portable pinned memory and mapped pinned memory in the same release forced us to resolve some API design problems that might have been difficult, if we had shipped one feature and not the other. cudaHostAlloc() can perform a portable allocation, and optionally also map that memory into the CUDA address space(s) of the GPU(s) in the system. There was a strong temptation to pass back both the host and device pointer in that one API call, but at the time we could not guarantee that every GPU in a multi-GPU system would get the same GPU address. So we separated the request that host memory be mapped (cudaHostAlloc()) from the query to obtain the device pointer corresponding to a given host allocation (cudaHostGetDevicePointer()).

With CUDA 4.0, unified virtual addressing does guarantee that the device and host pointers are all the same – but when CUDA 2.2 was being designed, the GPUs did not support 64-bit addressing and, in any case, even today not all operating systems support UVA!

CUDA 4.0 also simplified multi-GPU programming in another way: it enabled CUDA runtime applications to drive multiple GPUs from a single CPU thread*. Before CUDA 4.0, the only way to drive multiple GPUs was to spawn a thread per GPU. cudaSetDevice() had to be called before any CUDA initialization had occurred and, once it was called, that CPU thread henceforce could operate that device and only that device. This awkward API semantic – reminiscent of OpenGL and its “current context” – was definitely not the end product we wanted, but it was extremely challenging to make the CUDA driver thread-safe without incurring performance penalties due to excessive contention on the needed mutexes.

But after the driver was made thread-safe, it became possible for the cudaSetDevice() call to implement the semantic that developers tended to expect: after a cudaSetDevice() call, the corresponding CUDA context is made current to that thread and subsequent CUDA API calls and kernel launches are performed on the GPU corresponding to that context.

It will not take long to get the single-threaded multi-GPU implementation of N-body going, and that will conclude the multi-GPU (Chapter 9) coverage in the book. (UPDATE: the single-threaded implementation that uses cudaSetDevice() is done, and performance testing confirms that the performance difference on this workload is negligible.)

* CUDA 2.2 added the “current context” stack that could be manipulated with the driver API’s cuCtxPopCurrent()/cuCtxPushCurrent(), but few developers use the driver API.

N-Body: Multithreaded SSE

The multithreaded variant of SSE N-body is complete, and I’ve had the opportunity gather some timing information.

Three variants of N-body were timed: single-threaded SSE, multithreaded SSE, and the shared memory (fastest so far) GPU formulation.

Three platforms were tested, two of them on Amazon EC2:

  • cg1.4xlarge (2xXeon 5570 “Nehalem” with 2xTesla M2050 GPUs)
  • cc2.8xlarge (2xXeon 2670 “Sandy Bridge”)
  • GeForce GTX 680.

If I thought operating system mattered, I wouldn’t put these timings next to one another – the EC2 instances are running Amazon Linux but my GeForce GTX 680 is running Windows 7. With that caveat, the timings for N=4096, N=8192, N=16384 are as follows:

N cg1 (SSE) cc2 (SSE) cg1 (MT) cc2 (MT) cg1 (GPU) GK104
4096 36.24 40.31 4.78 2.79 1.35 0.66
8192 144.50 161.32 20.13 9.45 3.83 1.60
16384 576.75 649.83 72.78 39.46 12.66 6.05

Times are in milliseconds. The only difference between the SSE and MT timings is that the MT timing are done with multiple threads.

The first thing that struck me about the timing is that on a single-threaded SSE workload, cc2.8xlarge is slower than cg1.4xlarge! This result is surprising since Intel generally takes performance compatibility very seriously – they have a business incentive to deliver higher performance on old workloads, since it removes the need for customers to update their software to benefit from newer hardware – but can be explained if Intel delivered the same single-threaded performance clock-for-clock: the clock rate of the CPUs in cc2.8xlarge is 10% slower.

So the bulk of the benefit of cc2.8xlarge over cg1.4xlarge (or its GPU-less equivalent, cc1.4xlarge) is from an increased core count*. And on that front, Sandy Bridge delivers: scaling is excellent in the multithreaded case, with speedups of 14.4-17x on cc2.8xlarge versus 7.2-7.9x on cg1.4xlarge. We could double the core count again and probably still see excellent scaling.

No surprise that Intel expects us to port to AVX to get the full benefit of Sandy Bridge. But even if that delivers the expected doubling of performance, the multithreaded version would only be within 3.26x of the 16K bodies case. NVIDIA has their own doubling of performance in the offing, in the form of GK110.

And, I haven’t pushed the multi-GPU version yet. That will scale nicely in the number of GPUs.

* The number of threads selected is based on the number of cores available – in Linux, it is implemented in chThread.h as follows:

inline unsigned int
processorCount()
{
    return sysconf( _SC_NPROCESSORS_ONLN );
}

The values are 16 and 8 for cc2.8xlarge and cg1.4xlarge, respectively.

N-Body: FP Atomics v. Recomputation

CUDA developers who are recent refugees from the land of CPU programming often have to learn the hard way that sensible CPU optimizations don’t always work well on GPUs. Lookup tables, for example, are to be avoided. (Okay, not the best example since it is also true on modern CPUs.) Here’s a better one: registers are so precious that it’s often better to recompute results than to store them. GPUs have brought brute force back into style!

But even experienced GPU coders need the occasional reminder that their sheer computational power often can overwhelm the benefits of small-scale optimizations. This week, as I continued developing the N-body chapter, I wanted to try out the idea of taking advantage of the fact that gravitational forces are reciprocal in character: for any two bodies i and jfji = − fij. By applying this optimization, only half as many of the expensive inner-loop body-body computations (estimated at 20 FLOPS) need to be performed.

The original N-body paper dismisses the idea in a footnote: “…this optimization has an adverse effect on parallel evaluation strategies (especially with small N), so it is not employed in our implementation.”

At first blush, subtracting the 3D force from the correct force vector (3 FLOPS) seems a lot cheaper than the computation needed for a body-body interaction: taking a 3D vector difference, computing a dot product and reciprocal square root, and doing a number of multiply-adds. But there is a problem: another thread is concurrently computing the acceleration sum from which the body-body gravitational force values must be subtracted. Some form of synchronization is needed.

To address that problem, the Kepler architecture gives us a tool that wasn’t available when the original N-body paper was written: fast atomics. (Fermi-class hardware added the ability to do atomic floating point add, but it was prohibitively slow until Kepler.) Multiple threads can using atomic add on the same memory locations and the hardware will enforce mutual exclusion to prevent race conditions from corrupting the result. Using atomic add, a few small modifications enable the inner loop to perform half as many body-body interactions and add the negative force to the correct sum:

 <        for ( int j = 0; j < N; j++ ) {
 >        for ( int j = 0; j < i; j++ ) {
             float4 body = ((float4 *) posMass)[j];
 
             T ax, ay, az;
             bodyBodyInteraction( ax, ay, az, myX, myY, myZ, body.x, body.y, body.z, body.w, softeningSquared);
 
             acc[0] += ax; acc[1] += ay; acc[2] += az;
 
 >            float *f = &force[3*j+0];
 >            atomicAdd( f+0, -ax );
 >            atomicAdd( f+1, -ay );
 >            atomicAdd( f+2, -az );
 
         }

By the way, the address arithmetic is written that way because the compiler generates much better code than if you write the equivalent (and more readable):

             atomicAdd( &force[3*j+0], -ax );
             atomicAdd( &force[3*j+1], -ay );
             atomicAdd( &force[3*j+2], -az );

A subtle benefit of this approach is that the atomics hardware complements the floating point hardware in the Streaming Multiprocessors. Unfortunately though, I can report that the atomics-based implementation isn’t any faster on GK104: to process 8192 bodies (about 67M interactions), the three implementations (atomics, naive, shared) perform as follows:

Atomics: 90 ms
Naive: 62 ms
Shared: 55 ms

This negative result can be explained if you cast it in the light of GPUs’ predilection for brute force: given a choice between 3 FLOPS and 3 atomic adds to disparate memory locations, or 20 FLOPS done on uniformly-referenced, read-only memory, the brute force 20-FLOP solution still wins.

Lesson learned.

It is possible that the global memory references by this kernel are not ideal (especially if it’s camping on a subset of the hardware that implements atomics), and it might be faster to conserve external bandwidth, perhaps taking inspiration from a tiled matrix transpose implementation – but the optimized implementation would have to outperform a shared memory implementation that’s almost twice as fast.

As with all CUDA handbook source code, this kernel is open source and available on github.