N-Body: FP Atomics v. Recomputation

CUDA developers who are recent refugees from the land of CPU programming often have to learn the hard way that sensible CPU optimizations don’t always work well on GPUs. Lookup tables, for example, are to be avoided. (Okay, not the best example since it is also true on modern CPUs.) Here’s a better one: registers are so precious that it’s often better to recompute results than to store them. GPUs have brought brute force back into style!

But even experienced GPU coders need the occasional reminder that their sheer computational power often can overwhelm the benefits of small-scale optimizations. This week, as I continued developing the N-body chapter, I wanted to try out the idea of taking advantage of the fact that gravitational forces are reciprocal in character: for any two bodies i and jfji = − fij. By applying this optimization, only half as many of the expensive inner-loop body-body computations (estimated at 20 FLOPS) need to be performed.

The original N-body paper dismisses the idea in a footnote: “…this optimization has an adverse effect on parallel evaluation strategies (especially with small N), so it is not employed in our implementation.”

At first blush, subtracting the 3D force from the correct force vector (3 FLOPS) seems a lot cheaper than the computation needed for a body-body interaction: taking a 3D vector difference, computing a dot product and reciprocal square root, and doing a number of multiply-adds. But there is a problem: another thread is concurrently computing the acceleration sum from which the body-body gravitational force values must be subtracted. Some form of synchronization is needed.

To address that problem, the Kepler architecture gives us a tool that wasn’t available when the original N-body paper was written: fast atomics. (Fermi-class hardware added the ability to do atomic floating point add, but it was prohibitively slow until Kepler.) Multiple threads can using atomic add on the same memory locations and the hardware will enforce mutual exclusion to prevent race conditions from corrupting the result. Using atomic add, a few small modifications enable the inner loop to perform half as many body-body interactions and add the negative force to the correct sum:

 <        for ( int j = 0; j < N; j++ ) {
 >        for ( int j = 0; j < i; j++ ) {
             float4 body = ((float4 *) posMass)[j];
 
             T ax, ay, az;
             bodyBodyInteraction( ax, ay, az, myX, myY, myZ, body.x, body.y, body.z, body.w, softeningSquared);
 
             acc[0] += ax; acc[1] += ay; acc[2] += az;
 
 >            float *f = &force[3*j+0];
 >            atomicAdd( f+0, -ax );
 >            atomicAdd( f+1, -ay );
 >            atomicAdd( f+2, -az );
 
         }

By the way, the address arithmetic is written that way because the compiler generates much better code than if you write the equivalent (and more readable):

             atomicAdd( &force[3*j+0], -ax );
             atomicAdd( &force[3*j+1], -ay );
             atomicAdd( &force[3*j+2], -az );

A subtle benefit of this approach is that the atomics hardware complements the floating point hardware in the Streaming Multiprocessors. Unfortunately though, I can report that the atomics-based implementation isn’t any faster on GK104: to process 8192 bodies (about 67M interactions), the three implementations (atomics, naive, shared) perform as follows:

Atomics: 90 ms
Naive: 62 ms
Shared: 55 ms

This negative result can be explained if you cast it in the light of GPUs’ predilection for brute force: given a choice between 3 FLOPS and 3 atomic adds to disparate memory locations, or 20 FLOPS done on uniformly-referenced, read-only memory, the brute force 20-FLOP solution still wins.

Lesson learned.

It is possible that the global memory references by this kernel are not ideal (especially if it’s camping on a subset of the hardware that implements atomics), and it might be faster to conserve external bandwidth, perhaps taking inspiration from a tiled matrix transpose implementation – but the optimized implementation would have to outperform a shared memory implementation that’s almost twice as fast.

As with all CUDA handbook source code, this kernel is open source and available on github.

Multithreaded SSE Timings

The N-body code has been posted to github; it has not yet been ported to Linux, but that will not be difficult to do. It does not use graphics, it is just a Win32 console application that responds to keyboard input (ironically, other than porting the threading code, the keyboard support is the hard part about the Linux port of this app).

So far there are 8 formulations:

1) CPU_AOS (array-of-structures implementation – the gold standard to which we compare other implementations),
2) CPU_SOA (structure-of-arrays implementation) – a stepping stone to the SSE implementation (and surprisingly faster than the AOS version),
3) CPU_SSE (SSE implementation) – I blogged about this a few weeks ago,
4) CPU_SSE_threaded( multithreaded SSE implementation) – multithreaded SSE implementation that uses N threads on an N-core processor,
5) GPU_AOS – a straight port of the CPU_AOS code,
6) GPU_Shared –  GPU_AOS, optimized to use shared memory per the original CUDA N-body paper,
7) GPU_Atomic – a Kepler-specific version that uses atomics to do half as many body-body interaction computations, (but isn’t faster – I plan to blog about that later),
8) GPU_Shuffle – a Kepler-specific version that uses the new warp shuffle instruction instead of shared memory.

Here are some initial timing results for 16K bodies:

CPU_AOS – 4494 ms (60.0 Minteractions/s)
CPU_SOA – 3210 ms (83.6 Minteractions/s)
CPU_SSE – 612.6 ms (438 Minteractions/s)
CPU_SSE_threaded – 192 ms (1400 Minteractions/s)
GPU_AOS – 62 ms (4300 Minteractions/s)
GPU_Atomic – 93 ms (2886 Minteractions/s)
GPU_Shared – 55 ms (4881 Minteractions/s)
GPU_Shuffle – 58.7 ms (4573 Minteractions/s)

The fastest GPU performance is about 3.5 faster than the multithreaded SSE implementation. Another way to look at it: the SSE implementation will run on any CPU built since 2001 or so, and it’s only a factor of 3.5 slower. Then again, the code is super-gnarly compared to the CUDA code, especially the threading code. (Don’t take my word for it – take a look for yourself.) So here again, CUDA “wins” as much for programmability as for the performance win.

The interesting thing is that this data is gathered on a rather old CPU (a 2.8 GHz Intel i7 “Nehalem”), and the GPU is a relatively recent GK104. Once I get the code running on a contemporary CPU (say, a Sandy Bridge) I’ll report timings on that platform.

I did take a look at an AVX port, but that seemed even more off-topic than the SSE port, which was a stretch. That project will have to wait until the book’s done.

Next up: a Linux port and multi-GPU support.

Rough Cuts

I have updated the Web site’s Sample Chapters section to point to The CUDA Handbook‘s page in Safari Books Online’s Rough Cuts. Rough Cuts gives early access to books while they are still in progress – 3-6 months from publication, and includes tools for readers to give feedback to the author while there’s still time to make changes.

Several chapters are still missing, but they will be uploaded soon.

I’m still interested in reviewers who would be willing to take a look at sample chapters and submit feedback. Write me at nicholas@archaeasoftware.com.

N-Body: CUDA Versus Old-School SSE

This post will discuss SIMD instruction sets, peak floating point performance for CPUs, programming models, and ease-of-use, as applied to the flagship N-body application for CUDA.

My involvement in SIMD instruction sets dates back to the mid-1990s, when x86 vendors were adding MMX (all vendors), SSE (Intel) and 3DNow! (AMD, Cyrix et al.) to the x86 architecture. These SIMD instruction sets reflect a trend that had been developing in CPUs for some time: that because most of the die area in a CPU is dedicated to cache, building more-capable execution units incurs design and validation costs, but does not greatly increase manufacturing costs.

I was the development lead for Direct3D 6.0, and one of our goals was not only to optimize the geometry pipeline, but to incorporate SSE and 3DNow! optimizations from Intel and AMD, respectively. By closely collaborating with the CPU vendors, we gave developers a strong incentive to use Direct3D’s built-in geometry pipeline, and the CPU vendors got an easy way to deliver the benefits of their new instructions to end users.

SSE (Streaming SIMD Extensions) and 3DNow! are SIMD (single instruction, multiple data) instruction sets that can perform multiple floating point operations in a single instruction. SSE added a new set of 128-bit registers, plus instructions that could perform 4×32-bit floating point operations per instruction. (I am not going to further describe 3DNow! because all x86 vendors implemented SSE many years ago.)

Programming languages do not contain native language support for this type of packed data, so when they designed SSE, the x86 vendors were relying on compiler developers to build vectorizing compilers that could automatically identify parallelism opportunities in scalar code and compile it into high-quality, SIMD-optimized code. As a stopgap, they worked with compiler vendors to meet developers halfway and identify special data types and “intrinsics” (special functions that correspond to machine instructions) to operate on them. Developers using intrinsics must be intimately familiar with the instruction set, but the tasks of register allocation or instruction scheduling are offloaded onto the compiler. I don’t think intrinsics were ever intended to be anything other than a stopgap, but unfortunately, almost 15 years later, it looks like they are still the easiest way to write high-performance floating point code.

When I decided to focus on gravitational simulation for my N-body chapter, I also decided to see how fast a modern CPU can perform this computation. Many CUDA-versus-CPU performance comparisons are heavily biased toward CUDA, by dint of the CPU code being poorly optimized.

NVIDIA’s CPU implementation of N-body, used as a reference for correctness, is no exception. Unless a smart vectorizing compiler can translate the C code, the CPU reference implementation makes no use of the SSE or AVX instruction sets. That’s about a 4x (for SSE) or 8x (for AVX) performance disadvantage.

The body-body interaction code (which NVIDIA estimates to be 20 FLOPS) is as follows:

template <typename T>
__host__ __device__ void bodyBodyInteraction(
    T accel[3],
    T x0, T y0, T z0, T mass0,
    T x1, T y1, T z1, T mass1,
    T softeningSquared)
{
    T dx = x1 - x0;
    T dy = y1 - y0;
    T dz = z1 - z0;
    T distSqr = dx*dx + dy*dy + dz*dz;
    distSqr += softeningSquared;
    T invDist = (T)1.0 / (T)sqrt((double)distSqr);
    T invDistCube =  invDist * invDist * invDist;
    T s = mass1 * invDistCube;
    accel[0] += dx * s;
    accel[1] += dy * s;
    accel[2] += dz * s;
}

This function (templated to provide for both float and double precision) is invoked between every pair of bodies. For each body, the gravitational influence of every other body is added up, then these forces are integrated to change the bodies’ positions and velocities for the next timestep.

Now, if you think a bit about how to accelerate this function using a SIMD instruction set, you realize that what you really want to do is execute it four times in parallel: instead of

T dx = x1 - x0;

you want to write:

__m128 dx = _mm_sub_ps( x1, x0 );

where four bodies’ X positions are subtracted from your body’s X position, and so on. But the memory layout used by NVIDIA’s sample isn’t amenable to this formulation: it must be modified to use an SOA (structure of arrays) representation instead of the default AOS (array of structures) representation. In the case of N-body, where each body has a 3D position (x, y, z) and a mass, the N bodies can be represented as:

typedef struct _body {
    float x, y, z;
    float mass;
} body;

An array of structures (AOS):

struct _body bodies[N];

then contains the input data. The structure of arrays (SOA) representation splits each member of the structure into its own array:

float *bodiesX;
float *bodiesY;
float *bodiesZ;
float *bodiesMass;

It’s easy enough to transform between the two representations such that these identities hold true:

bodies[i].x == bodiesX[i]
bodies[i].y == bodiesY[i]
bodies[i].z == bodiesZ[i]
bodies[i].mass == bodiesMass[i]

Since programmers think of the X/Y/Z/Mass tuples together, the AOS representation seems more intuitive. But for SSE and AVX, the advantage of the SOA representation quickly becomes obvious; since instructions such as SUBPS (subtract packed single-precision float) operate on corresponding floats in the 128-bit registers, it works much better if the input data is streaming into the CPU in the form of packed X’s, Y’s and Z’s as opposed to structures that have to be rearranged in registers. Intel, in fact, was actively evangelizing SOA representations in the early days of SSE, since it enabled 4×4 matrix multiplication of vertices (needed for the geometry pipeline) without rearranging them. As an additional bonus, if the vertices were 3D, there was no need to deal with the inconvenient 12-byte vertex size. Truly general-purpose applications require extra work due to SSE’s alignment constraints (it is fastest to load and store on 16-byte boundaries), and its predilection for doing pretty much 4 of anything at a time necessitates careful handling of boundary conditions.

Once the data is rearranged as SOA, you can reformulate bodybodyInteraction() using SSE intrinsics to operate on four partial sums (gravitational force contributions to a given body) as follows:

inline void
bodyBodyInteraction(
    __m128& a0, __m128& a1, __m128& a2,
    const __m128& x0, const __m128& y0, const __m128& z0,
    const __m128& x1, const __m128& y1, const __m128& z1, const __m128& mass1,
    const __m128& softeningSquared)
{
    __m128 dx = _mm_sub_ps( x1, x0 );
    __m128 dy = _mm_sub_ps( y1, y0 );
    __m128 dz = _mm_sub_ps( z1, z0 );
    __m128 distSq = _mm_add_ps( _mm_add_ps( _mm_mul_ps( dx, dx ), _mm_mul_ps( dy, dy ) ), _mm_mul_ps( dz, dz ) );
    distSq = _mm_add_ps( distSq, softeningSquared );
    __m128 invDist = rcp_sqrt_nr_ps( distSq );
    __m128 invDistCube = _mm_mul_ps( invDist, _mm_mul_ps( invDist, invDist ) );
    __m128 s = _mm_mul_ps( mass1, invDistCube );
    a0 = _mm_add_ps( a0, _mm_mul_ps( dx, s ) );
    a1 = _mm_add_ps( a1, _mm_mul_ps( dy, s ) );
    a2 = _mm_add_ps( a2, _mm_mul_ps( dz, s ) );
}

This routine uses the <xmmintrin.h> header file (available both on Linux for gcc and Windows via MSVC), which declares the __m128 data type and the intrinsic functions that correspond to the SSE instruction set.

This routine uses two subroutines, both of which I was disappointed not to find in <xmmintrin.h> (in fact, <xmmintrin.h> seems not to have changed since 1998 or so when it was introduced): a subroutine to compute the sum of the four 32-bit floats in an SSE register, and a subroutine to compute the reciprocal square root. Because the RSQRTPS instruction only returns a 12-bit approximation, the rcp_sqrt_nr_ps() function, adapted from here, performs one Newton-Raphson iteration to achieve almost-full single precision on the four 32-bit floats in the SSE register. NOTE TO CPU ARCHITECTS: the SFU (“special function unit) in CUDA hardware implements a much higher-precision (22-23 bits) approximation to reciprocal square root that often can be used directly. The instruction sequence generated by rcp_sqrt_nr_ps() is 6 FLOPS but as many as 10-12 instructions (depending on how the constants are handled).

static inline __m128
rcp_sqrt_nr_ps(const __m128 x)
{
    const __m128
        nr      = _mm_rsqrt_ps(x),
        muls    = _mm_mul_ps(_mm_mul_ps(nr, nr), x),
        beta    = _mm_mul_ps(_mm_set_ps1(0.5f), nr),
        gamma   = _mm_sub_ps(_mm_set_ps1(3.0f), muls);
    return _mm_mul_ps(beta, gamma);
}

The other subroutine, adapted from a stackoverflow answer, computes the “horizontal” sum of the four 32-bit floats in an SSE register:

static inline __m128
horizontal_sum_ps( const __m128 x )
{
    const __m128 t = _mm_add_ps(x, _mm_movehl_ps(x, x));
    return _mm_add_ss(t, _mm_shuffle_ps(t, t, 1));
}

I don’t know about you guys, but I don’t think this intrinsics-based SSE code comes close to approaching the readability of the CUDA code, which “looks” scalar and is templatized to support double precision.

It does run fast, however; the resulting N-Body implementation, which is still single-threaded, takes 54 milliseconds to process 4096 bodies (40962=16777216 interactions) as opposed to 583 milliseconds for the naïve (presumably x87) implementation. I haven’t yet analyzed why it is 10x faster instead of 4x faster; I suspect the SSE reciprocal square root sequence is quite a bit faster than the scalar equivalent. Note, however, that recompiling the application with SSE2 optimizations enabled does not improve performance noticeably (not with Visual Studio, anyway).

Further performance opportunities exist in multithreading (it should scale almost-linearly in the number of cores) and porting to AVX, which should almost-double performance.

The not-yet-optimized CUDA implementation takes about 4.5 ms to do the same computation on a GK104: not 100x faster, but 10x faster and the code is much more readable. A side-by-side comparison of the two bodyBodyInteraction() routines illustrates the point made by Dr. Vincent Natoli, in his Kudos for CUDA article in hpcwire:

In a recent project we reduced 3,500 lines of highly-optimized C code to a CUDA kernel of about 800 lines. The optimized C was peppered with inline assembly, SSE macros, unrolled loops and special cases, making it difficult to read, extract algorithmic meaning and extend in the future. By comparison the CUDA code was cleaner and more readable. Ultimately it will be easier to maintain.

Which would you rather write?

T distSqr = dx*dx + dy*dy + dz*dz;

or

__m128 distSq = _mm_add_ps( _mm_add_ps( _mm_mul_ps( dx, dx ), _mm_mul_ps( dy, dy ) ), _mm_mul_ps( dz, dz ) );

?

The story’s still out on exactly how much of a performance advantage CUDA will have over a truly optimized CPU implementation of this application, but until vectorized compilers go mainstream, for this type of code, CUDA will have a big advantage in programmability.

Stream Callbacks

CUDA 5.0 adds a new feature called “stream callbacks,” a new mechanism for CPU/GPU synchronization. Previously, CPU/GPU synchronization was accomplished by calling functions like cuStreamSynchronize(), which returns when all preceding commands in the stream have been completed, or cuEventSynchronize(), which waits until the specified event has been recorded by the GPU.

I spent some time investigating how stream callbacks are implemented on Windows 7. cudaStreamAddCallback() specifies a callback that will be called once, when all preceding operations in the stream have completed. CUDA makes no representation about when this callback will be performed, or the order in which the callbacks will be called, except that callbacks in a given stream will be called in the same order they were specified to that stream.

It is hard to imagine how NVIDIA implemented stream callbacks without creating at least one extra CPU thread, something that we never did while I was working on CUDA. We always knew that we could do useful things by creating CPU threads without the developer’s knowledge or involvement, like asynchronous memcpy of pageable memory. But, we were reluctant to create threads without an opt-in or some other sign that the developer wanted extra threads running around in their process. Also, making the driver fully thread-safe was a very difficult problem, not solved until CUDA 4.0.

So I wrote an application to look at the number of active threads in the system (and note, “the number of active threads” is always subject to change in a multithreaded operating system – in Windows, you take a “snapshot” of the process, presumably implemented using copy-on-write semantics like fork() in old-time UNIX, and count the active threads in the snapshot) using code from this MSDN page.

I learned some interesting things. First of all, the CUDA runtime creates 2 threads at initialization time – my app reports 3 active threads after cudaFree(0) has been called. (I had to add a one-second sleep to my app to detect this, since snapshotting and counting the active threads immediately after calling cudaFree() still yielded a thread count of 1 – the application thread). Using the CUDA driver API, I confirmed that the CUDA driver does not create any extra threads at initialization time (cuInit()) but does create them along with the context (cuCtxCreate()).

Anyway, it seems the first time cudaStreamAddCallback() is called, yet another CPU thread is created for purposes of calling into stream callbacks. I instrumented my app to detect and report a change in the ID of the calling thread, and confirmed that only one thread is ever used for this purpose. (But it would seem unwise to rely on that assumption!)

It’s important not to make any assumptions about the threading model for stream callbacks. Make sure your code is thread-safe. The verbiage in the CUDA C Programming Guide doesn’t exactly lend confidence, either:

[C]allbacks must not use any CUDA APIs. They can, however, make blocking calls but should not create a transitive dependency on CUDA APIs that are not guaranteed to be ready before the callback. For example, waiting for a thread that is waiting on cudaStreamSynchronize() is as bad as calling it directly. Callbacks carry the same scheduling restrictions as other commands issued on streams.

They, in a given stream, execute in the order in which they are added. However, false dependencies between commands issued to different streams (refer to Streams) can introduce false dependencies between callbacks from different streams and between kernels and callbacks in different streams. Further, as the execution order of callbacks is not guaranteed, it is not safe to have synchronization in user code between callbacks.

“Blocking” callbacks are ones that suspend execution of the CUDA stream until they’ve returned.

So what are stream callbacks good for? Frankly, I am not yet sure. I don’t think they enable anything that wasn’t possible before – they seem to be implemented in terms of previously-available CUDA abstractions (a combination of blocking events and having the CUDA driver’s CPU thread alternate between waiting for a CUDA event and calling into the application’s callback). Maybe they’re intended to enable something on applications that use nested parallelism.

In the meantime, if stream callbacks provide a more convenient mechanism to implement your application, no one will fault you for using them.

Streaming Multiprocessors – updated!

A new version of the Streaming Multiprocessors chapter has been uploaded, this one with merged coverage of the math library for float and double, plus improved coverage of shared memory (especially shared memory atomics) and conditional code.

The code emitted by the compiler when performing shared memory atomics turns out to be the perfect illustration of how CUDA hardware handles conditional code. For the SM 2.0 architecture, a shared atomic add compiles to the following microcode (excerpted from Listing 8-2):

     /*0040*/     SSY 0x80;
     /*0048*/     BAR.RED.POPC RZ, RZ;
     /*0050*/     LD R0, [R0];
     /*0058*/     LDSLK P0, R2, [R3];
     /*0060*/     @P0 IADD R2, R2, R0;
     /*0068*/     @P0 STSUL [R3], R2;
     /*0070*/     @!P0 BRA 0x58;
     /*0078*/     NOP.S CC.T;

The SSY/NOP.S instructions bracket the divergence and convergence of the loop, which iterates until all threads in the warp have performed their atomic operation. Inside the loop, the LDSLK instruction attempts to lock the bank in shared memory. The actual atomic operation is predicated on whether the lock was acquired; conversely, the branch to reattempt the atomic operation is predicated so as to be taken if the lock was not acquired.

Enjoy the new version of the chapter, and stay tuned as we turn our attention to the other topics that the book will cover!