All posts by admin

Floating point: CPU and GPU differences

Some time ago, I wrote this in response to a StackOverflow question, but wanted to share here on the blog.

The question basically asked how you could make a floating point operation the same between the CPU and the GPU, and here is an updated version of the answer:

There are many reasons why it is not realistic to expect the same results from floating point computations run on the CPU and GPU. It’s much stronger than that: you can’t assume that FP results will be the same when the same source code is compiled against a different target architecture (e.g. x86 or x64) or with different optimization levels, either.

In fact, if your code is multithreaded and the FP operations are performed in different orders from one run to the next, then the EXACT SAME EXECUTABLE running on the EXACT SAME SYSTEM may produce slightly different results from one run to the next.

Some of the reasons include, but are not limited to:

  • floating point operations are not associative, so seemingly-benign reorderings (such as the race conditions from multithreading mentioned above) can change results;
  • different architectures support different levels of precision and rounding under different conditions (i.e. compiler flags, control word versus per instruction);
  • different compilers interpret the language standards differently, and
  • some architectures support FMAD (fused multiply-add) and some do not.

Note that for purposes of this discussion, the JIT compilers for CUDA (the magic that enables PTX code to be future-proof to GPU architectures that are not yet available) certainly should be expected to perturb FP results.

You have to write FP code that is robust despite the foregoing.

As I write this today, I believe that CUDA GPUs have a much better-designed architecture for floating point arithmetic than any contemporary CPU. GPUs include native IEEE standard (c. 2008) support for 16-bit floats and FMAD, have full-speed support for denormals, and enable rounding control on a per-instruction basis rather than control words whose settings have side effects on all FP instructions and are expensive to change.

In contrast, CPUs have an excess of per-thread state and poor performance except when using SIMD instructions, which mainstream compilers are terrible at exploiting for performance (since vectorizing scalar C code to take advantage of such instruction sets is much more difficult than building a compiler for a pseudo-scalar architecture such as CUDA). And if the wikipedia History page is to be believed, Intel and AMD appear to have completely botched the addition of FMAD support in a way that defies description.

You can find an excellent discussion of floating point precision and IEEE support in NVIDIA GPUs here.

Final Manuscript Submitted

I submitted the final manuscript last Friday, and thought I would reflect briefly on how it aligns with my original goals.

I’ve wanted write a book on CUDA for years. Until I left NVIDIA, I was just too busy building CUDA to work on it. So when it came time to write up a proposal, I’d been thinking about the subject matter and the organization for some time.

One of the exercises that authors undertake (and that editors demand as part of a proposal) is a competitive analysis: What books already exist that address the topic? How will yours be different? Why would someone buy your book instead of one of those other books? I knew I wanted my book to be comprehensive, covering topics like the driver API, all CUDA hardware, and all CUDA-capable platforms. I wanted it to cover both software abstractions (like streams and events) and how to write CUDA kernels. And as I weighed those aspirations and put together outlines and looked at the competitive landscape, it became clear to me: Writing a book on CUDA is hard.

When I said that to Ian Buck, he got an expression like a whipped puppy. “But why?” he asked. Hearing that anything about CUDA is hard upsets Ian because he believes the key to CUDA’s success has been simplicity. I told him, “Because in order to cover a topic correctly, I wind up explaining the same thing more than once, from different perspectives.” That is true, and is reflected in my book; but even books that cover only the CUDA runtime and the latest CUDA hardware can’t get away from the fundamental difficulty of CUDA programming: performance optimization is hard, and no one uses CUDA because it is cute and cuddly. People only ever use CUDA because it can run their applications faster. The reason CUDA programmers worry about performance bottlenecks and implementation details is because those considerations are weighed by developers engaged in low-level optimization, whether on GPUs or CPUs.

In looking at the landscape, I saw a lot of books that presented a lot of material on CUDA, but hadn’t imposed much of an organizational structure. So my book is organized into three parts. Part I gives overviews of the hardware, the software, and the operating environment; Part II (“Details”) gives in-depth descriptions of various CUDA abstractions, like memory, streams and events, and texturing; Part III (“Select Applications”) covers the full gamut of classes of CUDA application: streaming workloads (where PCI Express transfer overhead figures prominently), key parallel algorithms Reduction and Scan, an illustrative compute-intensive workload (N-body, the anti-streaming workload), and an image processing workload that combines texturing and shared memory for performance.

The source code accompanying Part II consists almost entirely of microdemos and microbenchmarks, while the source code accompanying Part III consists of several implementations of the same exact operation, with different tradeoffs in performance and complexity. Some of the microbenchmarks in Part II are intended to be reused (plug your own kernel into a makework kernel, for example), while reuse of code from Part III may be trickier. I did what I could, but reuse of application-specific code is intrinsically more difficult.

The code in the Streaming Workloads chapter can definitely intended be reused – just replace the SAXPY kernel with a different kernel, the more compute-intensive the better. Even calculations as complex as Black-Scholes options computation are transfer-bound on CUDA hardware*.

With Scan, one problem that hinders reuse is that NVIDIA has added numerous primitives that make Scan more efficient (like syncthreads_count() and warp shuffle). For a book that aspires to cover all CUDA hardware (all the way back to the seminal SM 1.x, “Tesla” hardware), that presents a challenge. For another, Scan has so many variants (inclusive/exclusive, segmented scan, predicate scan) that covering all permutations just of the basic primitive requires a lot of space, without covering any applications like Radix Sort or stream compaction. And if you try to cover all those permutations in the source code, you wind up with code that bears a disturbingly close resemblance to Thrust, except that Thrust was written by NVIDIA employees with a much more intimate understanding of modern C++ programming idioms than yours truly.

I will say this: The Scan chapter does a better job of covering the topic than any other book I have seen. I could have covered more, but then again, you could devote 100+ pages to the topic without covering everything.

For N-body, the fastest implementation just stages tiles into shared memory to make the data available to the SMs with lower latency; but after reviewing the literature on computationally-dense workloads, I saw an opportunity to broaden coverage beyond that. For one thing, some applications need shared memory for other purposes than a read-only, software-managed cache. The Direct Coulomb Summation code, described by Stone et al. and that is covered in detail in Kirk & Hwu’s Programming Massively Parallel Processors, stages read-only atom descriptions through constant memory, working around the 64K constant memory limit by doing the computation on 4,000 16-byte atoms at a time. So my book has an implementation that mirrors that strategy, even though it is slightly slower than the shared memory formulation.

A colleague pointed out that some N-body computations – in particular, the ones in the AMBER molecular modeling code – exploit symmetry of forces. So I spent some time exploring ways to take advantage of the symmetry of gravitational forces, without much success. One problem is that the gravitational computation is so lightweight that doing twice as many is faster than saving the work! AMBER does its calculations on 32×32 tiles (to correspond to CUDA’s warp size), and the source code on github includes an implementation that mirrors that strategy. It is sufficiently slower that I don’t even cover the source code in the book; as a compromise, I describe the strategy in the beginning of the chapter that gives an overview of the computation.

I’m cautiously optimistic that some method of exploiting symmetric forces will bear fruit, even for fairly lightweight calculations, but I wasn’t going to let that perfect be the enemy of the book’s done.

There are a couple of self-indulgent topics in the book: float->half conversion, normalized correlation, an SSE-optimized N-body implementation. But those were all included for good reasons. I still think the float->half conversion (which may seem out of place in the Streaming Multiprocessors chapter) is one of the best ways for developers to learn about floating point precision and rounding. Normalized correlation is the only application in the book that combines texturing and shared memory. And the SSE-optimized N-body implementation gives a stark illustration of how much easier CUDA is to program, while still yielding better performance. Even when biasing the results in favor of the CPU, N-body is still 8x faster on GK104 then on a dual-socket Xeon E2670 machine. (Porting to AVX might double performance on the CPU side, but by the same token, running on a GK110 might double performance on the GPU side, and using a faster GPU is a whole lot less software work than porting to AVX.) The reported result may not make NVIDIA happy (since I am reporting an 8x improvement, not 400x) and may not make Intel happy (since it underscores the difficulty of SIMD coding), but I think it puts CUDA in a positive light.

The book can be preordered on

* Black-Scholes was the workload that I used to prove out mapped pinned memory when we added it for the chipset team in CUDA 2.2, and we were pleasantly surprised to discover that GT200 also got a big benefit. (The architect who’d invested a lot of effort into making sure GT200 would be good at system memory rendering was gratified, but not surprised.)

On the Home Stretch…25 kLOC?

Wow. Has it really been two months since my last post? The calendar says yes.

I’ve been putting finishing touches on the manuscript, and as part of that exercise, I did a line count to see how much code we can say accompanies the book.

I was surprised to see it’s slightly more than 25,000 lines!

As a reminder, the source code resides in this Git repository.

The code frequently offers multiple versions of the same algorithm, but with this book, that’s the whole point. In some cases (e.g. the Reduction chapter), readers are walked through different versions of functionally-identical code, as part of the pedagogical exercise; in others (e.g. the N-body chapter), readers are encouraged to pick through the different versions and select the one that’s the closest fit with their application.

Line counts aren’t the best way to measure code complexity, but let no one say this book doesn’t offer much sample code for its readers.

N-Body Update: Multi-GPU

The first multi-GPU implementation of N-body is done, and it’s old school: using the same thread delegation code as the multithreaded CPU implementation, it uses a separate CPU thread for each GPU.

For workloads like N-body, that’s probably not necessary – N-body is so GPU-bound that the CPU is more of a traffic cop than an active contributor to the computation – but in a world where 4-core CPUs are common and 16-core CPUs available, it seems terribly wasteful to drive multiple GPUs from a single core. The single-threaded multi-GPU implementation is forthcoming, so we’ll know soon whether it’s a win on this particular app.

The key element of multi-GPU programming is portable pinned memory. We added this feature in CUDA 2.2, along with a host of other pinned memory-related features (mainly mapped pinned memory, but also the ability to allocate write-combining pinned memory). Unlike previous versions of CUDA, where only the context that called cudaMallocHost() could remember that the allocation was pinned, in CUDA 2.2, portable host memory allocations are available to all contexts. As a result, the application realizes the benefits of pinned memory (say, that it can be specified to cudaMemcpyAsync()) for every CUDA context – and every GPU – in the system.

Adding portable pinned memory and mapped pinned memory in the same release forced us to resolve some API design problems that might have been difficult, if we had shipped one feature and not the other. cudaHostAlloc() can perform a portable allocation, and optionally also map that memory into the CUDA address space(s) of the GPU(s) in the system. There was a strong temptation to pass back both the host and device pointer in that one API call, but at the time we could not guarantee that every GPU in a multi-GPU system would get the same GPU address. So we separated the request that host memory be mapped (cudaHostAlloc()) from the query to obtain the device pointer corresponding to a given host allocation (cudaHostGetDevicePointer()).

With CUDA 4.0, unified virtual addressing does guarantee that the device and host pointers are all the same – but when CUDA 2.2 was being designed, the GPUs did not support 64-bit addressing and, in any case, even today not all operating systems support UVA!

CUDA 4.0 also simplified multi-GPU programming in another way: it enabled CUDA runtime applications to drive multiple GPUs from a single CPU thread*. Before CUDA 4.0, the only way to drive multiple GPUs was to spawn a thread per GPU. cudaSetDevice() had to be called before any CUDA initialization had occurred and, once it was called, that CPU thread henceforce could operate that device and only that device. This awkward API semantic – reminiscent of OpenGL and its “current context” – was definitely not the end product we wanted, but it was extremely challenging to make the CUDA driver thread-safe without incurring performance penalties due to excessive contention on the needed mutexes.

But after the driver was made thread-safe, it became possible for the cudaSetDevice() call to implement the semantic that developers tended to expect: after a cudaSetDevice() call, the corresponding CUDA context is made current to that thread and subsequent CUDA API calls and kernel launches are performed on the GPU corresponding to that context.

It will not take long to get the single-threaded multi-GPU implementation of N-body going, and that will conclude the multi-GPU (Chapter 9) coverage in the book. (UPDATE: the single-threaded implementation that uses cudaSetDevice() is done, and performance testing confirms that the performance difference on this workload is negligible.)

* CUDA 2.2 added the “current context” stack that could be manipulated with the driver API’s cuCtxPopCurrent()/cuCtxPushCurrent(), but few developers use the driver API.

N-Body: Multithreaded SSE

The multithreaded variant of SSE N-body is complete, and I’ve had the opportunity gather some timing information.

Three variants of N-body were timed: single-threaded SSE, multithreaded SSE, and the shared memory (fastest so far) GPU formulation.

Three platforms were tested, two of them on Amazon EC2:

  • cg1.4xlarge (2xXeon 5570 “Nehalem” with 2xTesla M2050 GPUs)
  • cc2.8xlarge (2xXeon 2670 “Sandy Bridge”)
  • GeForce GTX 680.

If I thought operating system mattered, I wouldn’t put these timings next to one another – the EC2 instances are running Amazon Linux but my GeForce GTX 680 is running Windows 7. With that caveat, the timings for N=4096, N=8192, N=16384 are as follows:

N cg1 (SSE) cc2 (SSE) cg1 (MT) cc2 (MT) cg1 (GPU) GK104
4096 36.24 40.31 4.78 2.79 1.35 0.66
8192 144.50 161.32 20.13 9.45 3.83 1.60
16384 576.75 649.83 72.78 39.46 12.66 6.05

Times are in milliseconds. The only difference between the SSE and MT timings is that the MT timing are done with multiple threads.

The first thing that struck me about the timing is that on a single-threaded SSE workload, cc2.8xlarge is slower than cg1.4xlarge! This result is surprising since Intel generally takes performance compatibility very seriously – they have a business incentive to deliver higher performance on old workloads, since it removes the need for customers to update their software to benefit from newer hardware – but can be explained if Intel delivered the same single-threaded performance clock-for-clock: the clock rate of the CPUs in cc2.8xlarge is 10% slower.

So the bulk of the benefit of cc2.8xlarge over cg1.4xlarge (or its GPU-less equivalent, cc1.4xlarge) is from an increased core count*. And on that front, Sandy Bridge delivers: scaling is excellent in the multithreaded case, with speedups of 14.4-17x on cc2.8xlarge versus 7.2-7.9x on cg1.4xlarge. We could double the core count again and probably still see excellent scaling.

No surprise that Intel expects us to port to AVX to get the full benefit of Sandy Bridge. But even if that delivers the expected doubling of performance, the multithreaded version would only be within 3.26x of the 16K bodies case. NVIDIA has their own doubling of performance in the offing, in the form of GK110.

And, I haven’t pushed the multi-GPU version yet. That will scale nicely in the number of GPUs.

* The number of threads selected is based on the number of cores available – in Linux, it is implemented in chThread.h as follows:

inline unsigned int
    return sysconf( _SC_NPROCESSORS_ONLN );

The values are 16 and 8 for cc2.8xlarge and cg1.4xlarge, respectively.

N-Body: FP Atomics v. Recomputation

CUDA developers who are recent refugees from the land of CPU programming often have to learn the hard way that sensible CPU optimizations don’t always work well on GPUs. Lookup tables, for example, are to be avoided. (Okay, not the best example since it is also true on modern CPUs.) Here’s a better one: registers are so precious that it’s often better to recompute results than to store them. GPUs have brought brute force back into style!

But even experienced GPU coders need the occasional reminder that their sheer computational power often can overwhelm the benefits of small-scale optimizations. This week, as I continued developing the N-body chapter, I wanted to try out the idea of taking advantage of the fact that gravitational forces are reciprocal in character: for any two bodies i and jfji = − fij. By applying this optimization, only half as many of the expensive inner-loop body-body computations (estimated at 20 FLOPS) need to be performed.

The original N-body paper dismisses the idea in a footnote: “…this optimization has an adverse effect on parallel evaluation strategies (especially with small N), so it is not employed in our implementation.”

At first blush, subtracting the 3D force from the correct force vector (3 FLOPS) seems a lot cheaper than the computation needed for a body-body interaction: taking a 3D vector difference, computing a dot product and reciprocal square root, and doing a number of multiply-adds. But there is a problem: another thread is concurrently computing the acceleration sum from which the body-body gravitational force values must be subtracted. Some form of synchronization is needed.

To address that problem, the Kepler architecture gives us a tool that wasn’t available when the original N-body paper was written: fast atomics. (Fermi-class hardware added the ability to do atomic floating point add, but it was prohibitively slow until Kepler.) Multiple threads can using atomic add on the same memory locations and the hardware will enforce mutual exclusion to prevent race conditions from corrupting the result. Using atomic add, a few small modifications enable the inner loop to perform half as many body-body interactions and add the negative force to the correct sum:

 <        for ( int j = 0; j < N; j++ ) {
 >        for ( int j = 0; j < i; j++ ) {
             float4 body = ((float4 *) posMass)[j];
             T ax, ay, az;
             bodyBodyInteraction( ax, ay, az, myX, myY, myZ, body.x, body.y, body.z, body.w, softeningSquared);
             acc[0] += ax; acc[1] += ay; acc[2] += az;
 >            float *f = &force[3*j+0];
 >            atomicAdd( f+0, -ax );
 >            atomicAdd( f+1, -ay );
 >            atomicAdd( f+2, -az );

By the way, the address arithmetic is written that way because the compiler generates much better code than if you write the equivalent (and more readable):

             atomicAdd( &force[3*j+0], -ax );
             atomicAdd( &force[3*j+1], -ay );
             atomicAdd( &force[3*j+2], -az );

A subtle benefit of this approach is that the atomics hardware complements the floating point hardware in the Streaming Multiprocessors. Unfortunately though, I can report that the atomics-based implementation isn’t any faster on GK104: to process 8192 bodies (about 67M interactions), the three implementations (atomics, naive, shared) perform as follows:

Atomics: 90 ms
Naive: 62 ms
Shared: 55 ms

This negative result can be explained if you cast it in the light of GPUs’ predilection for brute force: given a choice between 3 FLOPS and 3 atomic adds to disparate memory locations, or 20 FLOPS done on uniformly-referenced, read-only memory, the brute force 20-FLOP solution still wins.

Lesson learned.

It is possible that the global memory references by this kernel are not ideal (especially if it’s camping on a subset of the hardware that implements atomics), and it might be faster to conserve external bandwidth, perhaps taking inspiration from a tiled matrix transpose implementation – but the optimized implementation would have to outperform a shared memory implementation that’s almost twice as fast.

As with all CUDA handbook source code, this kernel is open source and available on github.

Multithreaded SSE Timings

The N-body code has been posted to github; it has not yet been ported to Linux, but that will not be difficult to do. It does not use graphics, it is just a Win32 console application that responds to keyboard input (ironically, other than porting the threading code, the keyboard support is the hard part about the Linux port of this app).

So far there are 8 formulations:

1) CPU_AOS (array-of-structures implementation – the gold standard to which we compare other implementations),
2) CPU_SOA (structure-of-arrays implementation) – a stepping stone to the SSE implementation (and surprisingly faster than the AOS version),
3) CPU_SSE (SSE implementation) – I blogged about this a few weeks ago,
4) CPU_SSE_threaded( multithreaded SSE implementation) – multithreaded SSE implementation that uses N threads on an N-core processor,
5) GPU_AOS – a straight port of the CPU_AOS code,
6) GPU_Shared –  GPU_AOS, optimized to use shared memory per the original CUDA N-body paper,
7) GPU_Atomic – a Kepler-specific version that uses atomics to do half as many body-body interaction computations, (but isn’t faster – I plan to blog about that later),
8) GPU_Shuffle – a Kepler-specific version that uses the new warp shuffle instruction instead of shared memory.

Here are some initial timing results for 16K bodies:

CPU_AOS – 4494 ms (60.0 Minteractions/s)
CPU_SOA – 3210 ms (83.6 Minteractions/s)
CPU_SSE – 612.6 ms (438 Minteractions/s)
CPU_SSE_threaded – 192 ms (1400 Minteractions/s)
GPU_AOS – 62 ms (4300 Minteractions/s)
GPU_Atomic – 93 ms (2886 Minteractions/s)
GPU_Shared – 55 ms (4881 Minteractions/s)
GPU_Shuffle – 58.7 ms (4573 Minteractions/s)

The fastest GPU performance is about 3.5 faster than the multithreaded SSE implementation. Another way to look at it: the SSE implementation will run on any CPU built since 2001 or so, and it’s only a factor of 3.5 slower. Then again, the code is super-gnarly compared to the CUDA code, especially the threading code. (Don’t take my word for it – take a look for yourself.) So here again, CUDA “wins” as much for programmability as for the performance win.

The interesting thing is that this data is gathered on a rather old CPU (a 2.8 GHz Intel i7 “Nehalem”), and the GPU is a relatively recent GK104. Once I get the code running on a contemporary CPU (say, a Sandy Bridge) I’ll report timings on that platform.

I did take a look at an AVX port, but that seemed even more off-topic than the SSE port, which was a stretch. That project will have to wait until the book’s done.

Next up: a Linux port and multi-GPU support.

Rough Cuts

I have updated the Web site’s Sample Chapters section to point to The CUDA Handbook‘s page in Safari Books Online’s Rough Cuts. Rough Cuts gives early access to books while they are still in progress – 3-6 months from publication, and includes tools for readers to give feedback to the author while there’s still time to make changes.

Several chapters are still missing, but they will be uploaded soon.

I’m still interested in reviewers who would be willing to take a look at sample chapters and submit feedback. Write me at

N-Body: CUDA Versus Old-School SSE

This post will discuss SIMD instruction sets, peak floating point performance for CPUs, programming models, and ease-of-use, as applied to the flagship N-body application for CUDA.

My involvement in SIMD instruction sets dates back to the mid-1990s, when x86 vendors were adding MMX (all vendors), SSE (Intel) and 3DNow! (AMD, Cyrix et al.) to the x86 architecture. These SIMD instruction sets reflect a trend that had been developing in CPUs for some time: that because most of the die area in a CPU is dedicated to cache, building more-capable execution units incurs design and validation costs, but does not greatly increase manufacturing costs.

I was the development lead for Direct3D 6.0, and one of our goals was not only to optimize the geometry pipeline, but to incorporate SSE and 3DNow! optimizations from Intel and AMD, respectively. By closely collaborating with the CPU vendors, we gave developers a strong incentive to use Direct3D’s built-in geometry pipeline, and the CPU vendors got an easy way to deliver the benefits of their new instructions to end users.

SSE (Streaming SIMD Extensions) and 3DNow! are SIMD (single instruction, multiple data) instruction sets that can perform multiple floating point operations in a single instruction. SSE added a new set of 128-bit registers, plus instructions that could perform 4×32-bit floating point operations per instruction. (I am not going to further describe 3DNow! because all x86 vendors implemented SSE many years ago.)

Programming languages do not contain native language support for this type of packed data, so when they designed SSE, the x86 vendors were relying on compiler developers to build vectorizing compilers that could automatically identify parallelism opportunities in scalar code and compile it into high-quality, SIMD-optimized code. As a stopgap, they worked with compiler vendors to meet developers halfway and identify special data types and “intrinsics” (special functions that correspond to machine instructions) to operate on them. Developers using intrinsics must be intimately familiar with the instruction set, but the tasks of register allocation or instruction scheduling are offloaded onto the compiler. I don’t think intrinsics were ever intended to be anything other than a stopgap, but unfortunately, almost 15 years later, it looks like they are still the easiest way to write high-performance floating point code.

When I decided to focus on gravitational simulation for my N-body chapter, I also decided to see how fast a modern CPU can perform this computation. Many CUDA-versus-CPU performance comparisons are heavily biased toward CUDA, by dint of the CPU code being poorly optimized.

NVIDIA’s CPU implementation of N-body, used as a reference for correctness, is no exception. Unless a smart vectorizing compiler can translate the C code, the CPU reference implementation makes no use of the SSE or AVX instruction sets. That’s about a 4x (for SSE) or 8x (for AVX) performance disadvantage.

The body-body interaction code (which NVIDIA estimates to be 20 FLOPS) is as follows:

template <typename T>
__host__ __device__ void bodyBodyInteraction(
    T accel[3],
    T x0, T y0, T z0, T mass0,
    T x1, T y1, T z1, T mass1,
    T softeningSquared)
    T dx = x1 - x0;
    T dy = y1 - y0;
    T dz = z1 - z0;
    T distSqr = dx*dx + dy*dy + dz*dz;
    distSqr += softeningSquared;
    T invDist = (T)1.0 / (T)sqrt((double)distSqr);
    T invDistCube =  invDist * invDist * invDist;
    T s = mass1 * invDistCube;
    accel[0] += dx * s;
    accel[1] += dy * s;
    accel[2] += dz * s;

This function (templated to provide for both float and double precision) is invoked between every pair of bodies. For each body, the gravitational influence of every other body is added up, then these forces are integrated to change the bodies’ positions and velocities for the next timestep.

Now, if you think a bit about how to accelerate this function using a SIMD instruction set, you realize that what you really want to do is execute it four times in parallel: instead of

T dx = x1 - x0;

you want to write:

__m128 dx = _mm_sub_ps( x1, x0 );

where four bodies’ X positions are subtracted from your body’s X position, and so on. But the memory layout used by NVIDIA’s sample isn’t amenable to this formulation: it must be modified to use an SOA (structure of arrays) representation instead of the default AOS (array of structures) representation. In the case of N-body, where each body has a 3D position (x, y, z) and a mass, the N bodies can be represented as:

typedef struct _body {
    float x, y, z;
    float mass;
} body;

An array of structures (AOS):

struct _body bodies[N];

then contains the input data. The structure of arrays (SOA) representation splits each member of the structure into its own array:

float *bodiesX;
float *bodiesY;
float *bodiesZ;
float *bodiesMass;

It’s easy enough to transform between the two representations such that these identities hold true:

bodies[i].x == bodiesX[i]
bodies[i].y == bodiesY[i]
bodies[i].z == bodiesZ[i]
bodies[i].mass == bodiesMass[i]

Since programmers think of the X/Y/Z/Mass tuples together, the AOS representation seems more intuitive. But for SSE and AVX, the advantage of the SOA representation quickly becomes obvious; since instructions such as SUBPS (subtract packed single-precision float) operate on corresponding floats in the 128-bit registers, it works much better if the input data is streaming into the CPU in the form of packed X’s, Y’s and Z’s as opposed to structures that have to be rearranged in registers. Intel, in fact, was actively evangelizing SOA representations in the early days of SSE, since it enabled 4×4 matrix multiplication of vertices (needed for the geometry pipeline) without rearranging them. As an additional bonus, if the vertices were 3D, there was no need to deal with the inconvenient 12-byte vertex size. Truly general-purpose applications require extra work due to SSE’s alignment constraints (it is fastest to load and store on 16-byte boundaries), and its predilection for doing pretty much 4 of anything at a time necessitates careful handling of boundary conditions.

Once the data is rearranged as SOA, you can reformulate bodybodyInteraction() using SSE intrinsics to operate on four partial sums (gravitational force contributions to a given body) as follows:

inline void
    __m128& a0, __m128& a1, __m128& a2,
    const __m128& x0, const __m128& y0, const __m128& z0,
    const __m128& x1, const __m128& y1, const __m128& z1, const __m128& mass1,
    const __m128& softeningSquared)
    __m128 dx = _mm_sub_ps( x1, x0 );
    __m128 dy = _mm_sub_ps( y1, y0 );
    __m128 dz = _mm_sub_ps( z1, z0 );
    __m128 distSq = _mm_add_ps( _mm_add_ps( _mm_mul_ps( dx, dx ), _mm_mul_ps( dy, dy ) ), _mm_mul_ps( dz, dz ) );
    distSq = _mm_add_ps( distSq, softeningSquared );
    __m128 invDist = rcp_sqrt_nr_ps( distSq );
    __m128 invDistCube = _mm_mul_ps( invDist, _mm_mul_ps( invDist, invDist ) );
    __m128 s = _mm_mul_ps( mass1, invDistCube );
    a0 = _mm_add_ps( a0, _mm_mul_ps( dx, s ) );
    a1 = _mm_add_ps( a1, _mm_mul_ps( dy, s ) );
    a2 = _mm_add_ps( a2, _mm_mul_ps( dz, s ) );

This routine uses the <xmmintrin.h> header file (available both on Linux for gcc and Windows via MSVC), which declares the __m128 data type and the intrinsic functions that correspond to the SSE instruction set.

This routine uses two subroutines, both of which I was disappointed not to find in <xmmintrin.h> (in fact, <xmmintrin.h> seems not to have changed since 1998 or so when it was introduced): a subroutine to compute the sum of the four 32-bit floats in an SSE register, and a subroutine to compute the reciprocal square root. Because the RSQRTPS instruction only returns a 12-bit approximation, the rcp_sqrt_nr_ps() function, adapted from here, performs one Newton-Raphson iteration to achieve almost-full single precision on the four 32-bit floats in the SSE register. NOTE TO CPU ARCHITECTS: the SFU (“special function unit) in CUDA hardware implements a much higher-precision (22-23 bits) approximation to reciprocal square root that often can be used directly. The instruction sequence generated by rcp_sqrt_nr_ps() is 6 FLOPS but as many as 10-12 instructions (depending on how the constants are handled).

static inline __m128
rcp_sqrt_nr_ps(const __m128 x)
    const __m128
        nr      = _mm_rsqrt_ps(x),
        muls    = _mm_mul_ps(_mm_mul_ps(nr, nr), x),
        beta    = _mm_mul_ps(_mm_set_ps1(0.5f), nr),
        gamma   = _mm_sub_ps(_mm_set_ps1(3.0f), muls);
    return _mm_mul_ps(beta, gamma);

The other subroutine, adapted from a stackoverflow answer, computes the “horizontal” sum of the four 32-bit floats in an SSE register:

static inline __m128
horizontal_sum_ps( const __m128 x )
    const __m128 t = _mm_add_ps(x, _mm_movehl_ps(x, x));
    return _mm_add_ss(t, _mm_shuffle_ps(t, t, 1));

I don’t know about you guys, but I don’t think this intrinsics-based SSE code comes close to approaching the readability of the CUDA code, which “looks” scalar and is templatized to support double precision.

It does run fast, however; the resulting N-Body implementation, which is still single-threaded, takes 54 milliseconds to process 4096 bodies (40962=16777216 interactions) as opposed to 583 milliseconds for the naïve (presumably x87) implementation. I haven’t yet analyzed why it is 10x faster instead of 4x faster; I suspect the SSE reciprocal square root sequence is quite a bit faster than the scalar equivalent. Note, however, that recompiling the application with SSE2 optimizations enabled does not improve performance noticeably (not with Visual Studio, anyway).

Further performance opportunities exist in multithreading (it should scale almost-linearly in the number of cores) and porting to AVX, which should almost-double performance.

The not-yet-optimized CUDA implementation takes about 4.5 ms to do the same computation on a GK104: not 100x faster, but 10x faster and the code is much more readable. A side-by-side comparison of the two bodyBodyInteraction() routines illustrates the point made by Dr. Vincent Natoli, in his Kudos for CUDA article in hpcwire:

In a recent project we reduced 3,500 lines of highly-optimized C code to a CUDA kernel of about 800 lines. The optimized C was peppered with inline assembly, SSE macros, unrolled loops and special cases, making it difficult to read, extract algorithmic meaning and extend in the future. By comparison the CUDA code was cleaner and more readable. Ultimately it will be easier to maintain.

Which would you rather write?

T distSqr = dx*dx + dy*dy + dz*dz;


__m128 distSq = _mm_add_ps( _mm_add_ps( _mm_mul_ps( dx, dx ), _mm_mul_ps( dy, dy ) ), _mm_mul_ps( dz, dz ) );


The story’s still out on exactly how much of a performance advantage CUDA will have over a truly optimized CPU implementation of this application, but until vectorized compilers go mainstream, for this type of code, CUDA will have a big advantage in programmability.

Stream Callbacks

CUDA 5.0 adds a new feature called “stream callbacks,” a new mechanism for CPU/GPU synchronization. Previously, CPU/GPU synchronization was accomplished by calling functions like cuStreamSynchronize(), which returns when all preceding commands in the stream have been completed, or cuEventSynchronize(), which waits until the specified event has been recorded by the GPU.

I spent some time investigating how stream callbacks are implemented on Windows 7. cudaStreamAddCallback() specifies a callback that will be called once, when all preceding operations in the stream have completed. CUDA makes no representation about when this callback will be performed, or the order in which the callbacks will be called, except that callbacks in a given stream will be called in the same order they were specified to that stream.

It is hard to imagine how NVIDIA implemented stream callbacks without creating at least one extra CPU thread, something that we never did while I was working on CUDA. We always knew that we could do useful things by creating CPU threads without the developer’s knowledge or involvement, like asynchronous memcpy of pageable memory. But, we were reluctant to create threads without an opt-in or some other sign that the developer wanted extra threads running around in their process. Also, making the driver fully thread-safe was a very difficult problem, not solved until CUDA 4.0.

So I wrote an application to look at the number of active threads in the system (and note, “the number of active threads” is always subject to change in a multithreaded operating system – in Windows, you take a “snapshot” of the process, presumably implemented using copy-on-write semantics like fork() in old-time UNIX, and count the active threads in the snapshot) using code from this MSDN page.

I learned some interesting things. First of all, the CUDA runtime creates 2 threads at initialization time – my app reports 3 active threads after cudaFree(0) has been called. (I had to add a one-second sleep to my app to detect this, since snapshotting and counting the active threads immediately after calling cudaFree() still yielded a thread count of 1 – the application thread). Using the CUDA driver API, I confirmed that the CUDA driver does not create any extra threads at initialization time (cuInit()) but does create them along with the context (cuCtxCreate()).

Anyway, it seems the first time cudaStreamAddCallback() is called, yet another CPU thread is created for purposes of calling into stream callbacks. I instrumented my app to detect and report a change in the ID of the calling thread, and confirmed that only one thread is ever used for this purpose. (But it would seem unwise to rely on that assumption!)

It’s important not to make any assumptions about the threading model for stream callbacks. Make sure your code is thread-safe. The verbiage in the CUDA C Programming Guide doesn’t exactly lend confidence, either:

[C]allbacks must not use any CUDA APIs. They can, however, make blocking calls but should not create a transitive dependency on CUDA APIs that are not guaranteed to be ready before the callback. For example, waiting for a thread that is waiting on cudaStreamSynchronize() is as bad as calling it directly. Callbacks carry the same scheduling restrictions as other commands issued on streams.

They, in a given stream, execute in the order in which they are added. However, false dependencies between commands issued to different streams (refer to Streams) can introduce false dependencies between callbacks from different streams and between kernels and callbacks in different streams. Further, as the execution order of callbacks is not guaranteed, it is not safe to have synchronization in user code between callbacks.

“Blocking” callbacks are ones that suspend execution of the CUDA stream until they’ve returned.

So what are stream callbacks good for? Frankly, I am not yet sure. I don’t think they enable anything that wasn’t possible before – they seem to be implemented in terms of previously-available CUDA abstractions (a combination of blocking events and having the CUDA driver’s CPU thread alternate between waiting for a CUDA event and calling into the application’s callback). Maybe they’re intended to enable something on applications that use nested parallelism.

In the meantime, if stream callbacks provide a more convenient mechanism to implement your application, no one will fault you for using them.