Warp Synchrony and The First Law of CUDA Development

One of the most overlooked developments of GTC2017 was that NVIDIA’s Architecture Team has finally Had It Up To Here with developers who write warp synchronous code. As you may know, warp synchronous code relies on the way CUDA hardware executes 32-thread warps in lockstep. The CUDA Handbook contains some examples of warp synchronous code. In the reduction chapter, for example, warp synchronous code is used to optimize performance of the last 5 iterations of this loop that accumulates partial sums in shared memory:

for ( int activeThreads = blockDim.x>>1;
          activeThreads;
          activeThreads >>= 1 ) {
        if ( tid < activeThreads ) {
            sPartials[tid] += sPartials[tid+activeThreads];
        }
    __syncthreads();
}

Notice that every iteration of the loop is accompanied by a call to __syncthreads(), the intrinsic that serves as block synchronization primitive and memory barrier. The unrolled, warp synchronous implementation of the last 5 iterations looks like this:

if ( threadIdx.x < 32 ) {
    volatile int *wsSum = sPartials;
    if ( blockDim.x > 32 ) wsSum[tid] += wsSum[tid + 32];
        wsSum[tid] += wsSum[tid + 16];
        wsSum[tid] += wsSum[tid + 8];
        wsSum[tid] += wsSum[tid + 4];
        wsSum[tid] += wsSum[tid + 2];
        wsSum[tid] += wsSum[tid + 1];
    if ( tid == 0 ) {
        volatile int *wsSum = sPartials;
        out[blockIdx.x] = wsSum[0];
    }
}

The volatile keyword represents NVIDIA’s grudging acceptance of warp synchronous code. Historically, volatile is a keyword that hints to the compiler not to optimize out memory traffic through the associated pointer. The classic application is for device drivers for hardware with memory-mapped hardware registers, where reads and writes to “memory” are used to program the hardware. But volatile doesn’t give the compiler enough information; although it inhibits optimizations such as reusing registers or conserving memory writes, it’s not expressive enough to capture the synchronization semantics required when threads within a warp can diverge.

As a result, with Volta’s improved support for divergent code execution, NVIDIA is giving up on the volatile keyword workaround and deprecating all warp-level primitives. Instead, developers are encouraged to use new intrinsics with “_sync” appended. So instead of calling any(), the function that returns True if the input predicate expression is true for any of the 32 threads in the warp, we are to call any_sync().  The new function may be invoked on older hardware, and I suspect they are synonyms for the older functions; but on Volta, it likely will enforce semantics that converge execution across the warp.

After listening to the presentation at GTC, I sought out an NVIDIAn and told them that CUDA developers have always known that warp synchronous coding wasn’t strictly correct. NVIDIA has been finger-waggling at CUDA developers who write warp synchronous code for years! To gain some insight into why developers do it anyway, we turn our attention to a completely unscientific survey of developers where they were asked why they write CUDA code: FirstLaw_1Figure 1. Motivations for CUDA Development

I call this the First Law of CUDA Development: Performance is CUDA’s raison d’être. No one writes CUDA code for fun. Every CUDA user is trying to get a return on investment in the form of higher application performance. The reason developers write warp synchronous code even though it’s the “wrong” thing to do is because it is faster. Put another way, sprinkling __syncthreads() calls that turn out to be superfluous is… well… slower. (A subtler implication is that if the behavior does not change, it is harder for developers to tell which __syncthreads() calls are superfluous). Developers always want to do the right thing, I told the NVIDIAn; but ultimately, if you want developers doing the right thing, you have to make the right thing also be the fastest thing.

During the course of the conversation, the NVIDIAn defended the idea that they should break warp synchronous code in the future: “If I warn you to look both ways before you cross the road, don’t blame me if you get hit by a car.” I told him: “If that is your position, it’s your responsibility to make sure that developers who don’t look both ways ALWAYS get hit by a car.”

Why does CUDA CUdeviceptr use unsigned int instead of void?

This question on StackExchange was put on hold as primarily opinion-based: “…answers to this question will tend to be almost entirely based on opinions, rather than facts, references, or specific expertise.”

The content of StackExchange is usually high quality, but in this case, while the design decision was based on opinion, the answer to the question needn’t be… you just need to ask the people who know! And the inimitable talonmies, who is poised to crack 30k on StackExchange’s points-based reputation system, compounded the problem by saying that CUdeviceptr is a handle to a device memory allocation, not a pointer.

I don’t think I have ever seen talonmies give an incorrect answer before; but in this case, he’s off the mark. CUdeviceptr always has represented a pointer in the CUDA address space. In fact, though it was frowned upon to mix driver API and CUDA runtime code, even in CUDA 1.0 you could transform between CUDART’s void * and the driver API’s CUdeviceptr by writing something like:

void *p;
CUdeviceptr dptr;
p = (void *) (uintptr_t) dptr;
dptr = (CUdeviceptr) (uintptr_t) p;

We could have made device pointers void *, but there was a desire to make it easy for compilers to distinguish between host and device pointers at compile time instead of runtime. Furthermore, SM 1.x hardware only supported 32-bit pointers, so using void * would have created a difference in pointer size on 64-bit host platforms. It’s a long-distant memory now, since so much great compiler work has gone into CUDA since then, but at the time “pointer-squashing” (having CUDA’s compiler transform 64-bit pointers into 32-bit pointers on 64-bit host systems) was a big issue in early versions of CUDA.

For the record, not making the driver API’s device pointer type void * is one of my bigger regrets about early CUDA development. It took months to refactor the driver to support 64-bit device pointers when hardware support for that feature became available in SM 2.x class hardware.

In fact, some weeks before we released CUDA 1.0, we had a meeting and a serious discussion about replacing CUdeviceptr with void *, and decided not to take the schedule hit. We weren’t going to let perfect be the end of done, and we paid the price later.

While we’re on the topic of regrettable design decisions in early CUDA, I wish I had done a search-and-replace to convert cuFunction to cuKernel, and put cuLaunchKernel in the first release (in place of the stateful, chatty and not-thread-safe cuParamSet* family of functions). But we had scant engineering resources to spend on fit and finish, a constraint that is no less true for CUDA than for many other successful software projects in history.

Floating point: CPU and GPU differences

Some time ago, I wrote this in response to a StackOverflow question, but wanted to share here on the blog.

The question basically asked how you could make a floating point operation the same between the CPU and the GPU, and here is an updated version of the answer:

There are many reasons why it is not realistic to expect the same results from floating point computations run on the CPU and GPU. It’s much stronger than that: you can’t assume that FP results will be the same when the same source code is compiled against a different target architecture (e.g. x86 or x64) or with different optimization levels, either.

In fact, if your code is multithreaded and the FP operations are performed in different orders from one run to the next, then the EXACT SAME EXECUTABLE running on the EXACT SAME SYSTEM may produce slightly different results from one run to the next.

Some of the reasons include, but are not limited to:

  • floating point operations are not associative, so seemingly-benign reorderings (such as the race conditions from multithreading mentioned above) can change results;
  • different architectures support different levels of precision and rounding under different conditions (i.e. compiler flags, control word versus per instruction);
  • different compilers interpret the language standards differently, and
  • some architectures support FMAD (fused multiply-add) and some do not.

Note that for purposes of this discussion, the JIT compilers for CUDA (the magic that enables PTX code to be future-proof to GPU architectures that are not yet available) certainly should be expected to perturb FP results.

You have to write FP code that is robust despite the foregoing.

As I write this today, I believe that CUDA GPUs have a much better-designed architecture for floating point arithmetic than any contemporary CPU. GPUs include native IEEE standard (c. 2008) support for 16-bit floats and FMAD, have full-speed support for denormals, and enable rounding control on a per-instruction basis rather than control words whose settings have side effects on all FP instructions and are expensive to change.

In contrast, CPUs have an excess of per-thread state and poor performance except when using SIMD instructions, which mainstream compilers are terrible at exploiting for performance (since vectorizing scalar C code to take advantage of such instruction sets is much more difficult than building a compiler for a pseudo-scalar architecture such as CUDA). And if the wikipedia History page is to be believed, Intel and AMD appear to have completely botched the addition of FMAD support in a way that defies description.

You can find an excellent discussion of floating point precision and IEEE support in NVIDIA GPUs here.

Final Manuscript Submitted

I submitted the final manuscript last Friday, and thought I would reflect briefly on how it aligns with my original goals.

I’ve wanted write a book on CUDA for years. Until I left NVIDIA, I was just too busy building CUDA to work on it. So when it came time to write up a proposal, I’d been thinking about the subject matter and the organization for some time.

One of the exercises that authors undertake (and that editors demand as part of a proposal) is a competitive analysis: What books already exist that address the topic? How will yours be different? Why would someone buy your book instead of one of those other books? I knew I wanted my book to be comprehensive, covering topics like the driver API, all CUDA hardware, and all CUDA-capable platforms. I wanted it to cover both software abstractions (like streams and events) and how to write CUDA kernels. And as I weighed those aspirations and put together outlines and looked at the competitive landscape, it became clear to me: Writing a book on CUDA is hard.

When I said that to Ian Buck, he got an expression like a whipped puppy. “But why?” he asked. Hearing that anything about CUDA is hard upsets Ian because he believes the key to CUDA’s success has been simplicity. I told him, “Because in order to cover a topic correctly, I wind up explaining the same thing more than once, from different perspectives.” That is true, and is reflected in my book; but even books that cover only the CUDA runtime and the latest CUDA hardware can’t get away from the fundamental difficulty of CUDA programming: performance optimization is hard, and no one uses CUDA because it is cute and cuddly. People only ever use CUDA because it can run their applications faster. The reason CUDA programmers worry about performance bottlenecks and implementation details is because those considerations are weighed by developers engaged in low-level optimization, whether on GPUs or CPUs.

In looking at the landscape, I saw a lot of books that presented a lot of material on CUDA, but hadn’t imposed much of an organizational structure. So my book is organized into three parts. Part I gives overviews of the hardware, the software, and the operating environment; Part II (“Details”) gives in-depth descriptions of various CUDA abstractions, like memory, streams and events, and texturing; Part III (“Select Applications”) covers the full gamut of classes of CUDA application: streaming workloads (where PCI Express transfer overhead figures prominently), key parallel algorithms Reduction and Scan, an illustrative compute-intensive workload (N-body, the anti-streaming workload), and an image processing workload that combines texturing and shared memory for performance.

The source code accompanying Part II consists almost entirely of microdemos and microbenchmarks, while the source code accompanying Part III consists of several implementations of the same exact operation, with different tradeoffs in performance and complexity. Some of the microbenchmarks in Part II are intended to be reused (plug your own kernel into a makework kernel, for example), while reuse of code from Part III may be trickier. I did what I could, but reuse of application-specific code is intrinsically more difficult.

The code in the Streaming Workloads chapter can definitely intended be reused – just replace the SAXPY kernel with a different kernel, the more compute-intensive the better. Even calculations as complex as Black-Scholes options computation are transfer-bound on CUDA hardware*.

With Scan, one problem that hinders reuse is that NVIDIA has added numerous primitives that make Scan more efficient (like syncthreads_count() and warp shuffle). For a book that aspires to cover all CUDA hardware (all the way back to the seminal SM 1.x, “Tesla” hardware), that presents a challenge. For another, Scan has so many variants (inclusive/exclusive, segmented scan, predicate scan) that covering all permutations just of the basic primitive requires a lot of space, without covering any applications like Radix Sort or stream compaction. And if you try to cover all those permutations in the source code, you wind up with code that bears a disturbingly close resemblance to Thrust, except that Thrust was written by NVIDIA employees with a much more intimate understanding of modern C++ programming idioms than yours truly.

I will say this: The Scan chapter does a better job of covering the topic than any other book I have seen. I could have covered more, but then again, you could devote 100+ pages to the topic without covering everything.

For N-body, the fastest implementation just stages tiles into shared memory to make the data available to the SMs with lower latency; but after reviewing the literature on computationally-dense workloads, I saw an opportunity to broaden coverage beyond that. For one thing, some applications need shared memory for other purposes than a read-only, software-managed cache. The Direct Coulomb Summation code, described by Stone et al. and that is covered in detail in Kirk & Hwu’s Programming Massively Parallel Processors, stages read-only atom descriptions through constant memory, working around the 64K constant memory limit by doing the computation on 4,000 16-byte atoms at a time. So my book has an implementation that mirrors that strategy, even though it is slightly slower than the shared memory formulation.

A colleague pointed out that some N-body computations – in particular, the ones in the AMBER molecular modeling code – exploit symmetry of forces. So I spent some time exploring ways to take advantage of the symmetry of gravitational forces, without much success. One problem is that the gravitational computation is so lightweight that doing twice as many is faster than saving the work! AMBER does its calculations on 32×32 tiles (to correspond to CUDA’s warp size), and the source code on github includes an implementation that mirrors that strategy. It is sufficiently slower that I don’t even cover the source code in the book; as a compromise, I describe the strategy in the beginning of the chapter that gives an overview of the computation.

I’m cautiously optimistic that some method of exploiting symmetric forces will bear fruit, even for fairly lightweight calculations, but I wasn’t going to let that perfect be the enemy of the book’s done.

There are a couple of self-indulgent topics in the book: float->half conversion, normalized correlation, an SSE-optimized N-body implementation. But those were all included for good reasons. I still think the float->half conversion (which may seem out of place in the Streaming Multiprocessors chapter) is one of the best ways for developers to learn about floating point precision and rounding. Normalized correlation is the only application in the book that combines texturing and shared memory. And the SSE-optimized N-body implementation gives a stark illustration of how much easier CUDA is to program, while still yielding better performance. Even when biasing the results in favor of the CPU, N-body is still 8x faster on GK104 then on a dual-socket Xeon E2670 machine. (Porting to AVX might double performance on the CPU side, but by the same token, running on a GK110 might double performance on the GPU side, and using a faster GPU is a whole lot less software work than porting to AVX.) The reported result may not make NVIDIA happy (since I am reporting an 8x improvement, not 400x) and may not make Intel happy (since it underscores the difficulty of SIMD coding), but I think it puts CUDA in a positive light.

The book can be preordered on amazon.com.

* Black-Scholes was the workload that I used to prove out mapped pinned memory when we added it for the chipset team in CUDA 2.2, and we were pleasantly surprised to discover that GT200 also got a big benefit. (The architect who’d invested a lot of effort into making sure GT200 would be good at system memory rendering was gratified, but not surprised.)

On the Home Stretch…25 kLOC?

Wow. Has it really been two months since my last post? The calendar says yes.

I’ve been putting finishing touches on the manuscript, and as part of that exercise, I did a line count to see how much code we can say accompanies the book.

I was surprised to see it’s slightly more than 25,000 lines!

As a reminder, the source code resides in this Git repository.

The code frequently offers multiple versions of the same algorithm, but with this book, that’s the whole point. In some cases (e.g. the Reduction chapter), readers are walked through different versions of functionally-identical code, as part of the pedagogical exercise; in others (e.g. the N-body chapter), readers are encouraged to pick through the different versions and select the one that’s the closest fit with their application.

Line counts aren’t the best way to measure code complexity, but let no one say this book doesn’t offer much sample code for its readers.

N-Body Update: Multi-GPU

The first multi-GPU implementation of N-body is done, and it’s old school: using the same thread delegation code as the multithreaded CPU implementation, it uses a separate CPU thread for each GPU.

For workloads like N-body, that’s probably not necessary – N-body is so GPU-bound that the CPU is more of a traffic cop than an active contributor to the computation – but in a world where 4-core CPUs are common and 16-core CPUs available, it seems terribly wasteful to drive multiple GPUs from a single core. The single-threaded multi-GPU implementation is forthcoming, so we’ll know soon whether it’s a win on this particular app.

The key element of multi-GPU programming is portable pinned memory. We added this feature in CUDA 2.2, along with a host of other pinned memory-related features (mainly mapped pinned memory, but also the ability to allocate write-combining pinned memory). Unlike previous versions of CUDA, where only the context that called cudaMallocHost() could remember that the allocation was pinned, in CUDA 2.2, portable host memory allocations are available to all contexts. As a result, the application realizes the benefits of pinned memory (say, that it can be specified to cudaMemcpyAsync()) for every CUDA context – and every GPU – in the system.

Adding portable pinned memory and mapped pinned memory in the same release forced us to resolve some API design problems that might have been difficult, if we had shipped one feature and not the other. cudaHostAlloc() can perform a portable allocation, and optionally also map that memory into the CUDA address space(s) of the GPU(s) in the system. There was a strong temptation to pass back both the host and device pointer in that one API call, but at the time we could not guarantee that every GPU in a multi-GPU system would get the same GPU address. So we separated the request that host memory be mapped (cudaHostAlloc()) from the query to obtain the device pointer corresponding to a given host allocation (cudaHostGetDevicePointer()).

With CUDA 4.0, unified virtual addressing does guarantee that the device and host pointers are all the same – but when CUDA 2.2 was being designed, the GPUs did not support 64-bit addressing and, in any case, even today not all operating systems support UVA!

CUDA 4.0 also simplified multi-GPU programming in another way: it enabled CUDA runtime applications to drive multiple GPUs from a single CPU thread*. Before CUDA 4.0, the only way to drive multiple GPUs was to spawn a thread per GPU. cudaSetDevice() had to be called before any CUDA initialization had occurred and, once it was called, that CPU thread henceforce could operate that device and only that device. This awkward API semantic – reminiscent of OpenGL and its “current context” – was definitely not the end product we wanted, but it was extremely challenging to make the CUDA driver thread-safe without incurring performance penalties due to excessive contention on the needed mutexes.

But after the driver was made thread-safe, it became possible for the cudaSetDevice() call to implement the semantic that developers tended to expect: after a cudaSetDevice() call, the corresponding CUDA context is made current to that thread and subsequent CUDA API calls and kernel launches are performed on the GPU corresponding to that context.

It will not take long to get the single-threaded multi-GPU implementation of N-body going, and that will conclude the multi-GPU (Chapter 9) coverage in the book. (UPDATE: the single-threaded implementation that uses cudaSetDevice() is done, and performance testing confirms that the performance difference on this workload is negligible.)

* CUDA 2.2 added the “current context” stack that could be manipulated with the driver API’s cuCtxPopCurrent()/cuCtxPushCurrent(), but few developers use the driver API.

N-Body: Multithreaded SSE

The multithreaded variant of SSE N-body is complete, and I’ve had the opportunity gather some timing information.

Three variants of N-body were timed: single-threaded SSE, multithreaded SSE, and the shared memory (fastest so far) GPU formulation.

Three platforms were tested, two of them on Amazon EC2:

  • cg1.4xlarge (2xXeon 5570 “Nehalem” with 2xTesla M2050 GPUs)
  • cc2.8xlarge (2xXeon 2670 “Sandy Bridge”)
  • GeForce GTX 680.

If I thought operating system mattered, I wouldn’t put these timings next to one another – the EC2 instances are running Amazon Linux but my GeForce GTX 680 is running Windows 7. With that caveat, the timings for N=4096, N=8192, N=16384 are as follows:

N cg1 (SSE) cc2 (SSE) cg1 (MT) cc2 (MT) cg1 (GPU) GK104
4096 36.24 40.31 4.78 2.79 1.35 0.66
8192 144.50 161.32 20.13 9.45 3.83 1.60
16384 576.75 649.83 72.78 39.46 12.66 6.05

Times are in milliseconds. The only difference between the SSE and MT timings is that the MT timing are done with multiple threads.

The first thing that struck me about the timing is that on a single-threaded SSE workload, cc2.8xlarge is slower than cg1.4xlarge! This result is surprising since Intel generally takes performance compatibility very seriously – they have a business incentive to deliver higher performance on old workloads, since it removes the need for customers to update their software to benefit from newer hardware – but can be explained if Intel delivered the same single-threaded performance clock-for-clock: the clock rate of the CPUs in cc2.8xlarge is 10% slower.

So the bulk of the benefit of cc2.8xlarge over cg1.4xlarge (or its GPU-less equivalent, cc1.4xlarge) is from an increased core count*. And on that front, Sandy Bridge delivers: scaling is excellent in the multithreaded case, with speedups of 14.4-17x on cc2.8xlarge versus 7.2-7.9x on cg1.4xlarge. We could double the core count again and probably still see excellent scaling.

No surprise that Intel expects us to port to AVX to get the full benefit of Sandy Bridge. But even if that delivers the expected doubling of performance, the multithreaded version would only be within 3.26x of the 16K bodies case. NVIDIA has their own doubling of performance in the offing, in the form of GK110.

And, I haven’t pushed the multi-GPU version yet. That will scale nicely in the number of GPUs.

* The number of threads selected is based on the number of cores available – in Linux, it is implemented in chThread.h as follows:

inline unsigned int
processorCount()
{
    return sysconf( _SC_NPROCESSORS_ONLN );
}

The values are 16 and 8 for cc2.8xlarge and cg1.4xlarge, respectively.

N-Body: FP Atomics v. Recomputation

CUDA developers who are recent refugees from the land of CPU programming often have to learn the hard way that sensible CPU optimizations don’t always work well on GPUs. Lookup tables, for example, are to be avoided. (Okay, not the best example since it is also true on modern CPUs.) Here’s a better one: registers are so precious that it’s often better to recompute results than to store them. GPUs have brought brute force back into style!

But even experienced GPU coders need the occasional reminder that their sheer computational power often can overwhelm the benefits of small-scale optimizations. This week, as I continued developing the N-body chapter, I wanted to try out the idea of taking advantage of the fact that gravitational forces are reciprocal in character: for any two bodies i and jfji = − fij. By applying this optimization, only half as many of the expensive inner-loop body-body computations (estimated at 20 FLOPS) need to be performed.

The original N-body paper dismisses the idea in a footnote: “…this optimization has an adverse effect on parallel evaluation strategies (especially with small N), so it is not employed in our implementation.”

At first blush, subtracting the 3D force from the correct force vector (3 FLOPS) seems a lot cheaper than the computation needed for a body-body interaction: taking a 3D vector difference, computing a dot product and reciprocal square root, and doing a number of multiply-adds. But there is a problem: another thread is concurrently computing the acceleration sum from which the body-body gravitational force values must be subtracted. Some form of synchronization is needed.

To address that problem, the Kepler architecture gives us a tool that wasn’t available when the original N-body paper was written: fast atomics. (Fermi-class hardware added the ability to do atomic floating point add, but it was prohibitively slow until Kepler.) Multiple threads can using atomic add on the same memory locations and the hardware will enforce mutual exclusion to prevent race conditions from corrupting the result. Using atomic add, a few small modifications enable the inner loop to perform half as many body-body interactions and add the negative force to the correct sum:

 <        for ( int j = 0; j < N; j++ ) {
 >        for ( int j = 0; j < i; j++ ) {
             float4 body = ((float4 *) posMass)[j];
 
             T ax, ay, az;
             bodyBodyInteraction( ax, ay, az, myX, myY, myZ, body.x, body.y, body.z, body.w, softeningSquared);
 
             acc[0] += ax; acc[1] += ay; acc[2] += az;
 
 >            float *f = &force[3*j+0];
 >            atomicAdd( f+0, -ax );
 >            atomicAdd( f+1, -ay );
 >            atomicAdd( f+2, -az );
 
         }

By the way, the address arithmetic is written that way because the compiler generates much better code than if you write the equivalent (and more readable):

             atomicAdd( &force[3*j+0], -ax );
             atomicAdd( &force[3*j+1], -ay );
             atomicAdd( &force[3*j+2], -az );

A subtle benefit of this approach is that the atomics hardware complements the floating point hardware in the Streaming Multiprocessors. Unfortunately though, I can report that the atomics-based implementation isn’t any faster on GK104: to process 8192 bodies (about 67M interactions), the three implementations (atomics, naive, shared) perform as follows:

Atomics: 90 ms
Naive: 62 ms
Shared: 55 ms

This negative result can be explained if you cast it in the light of GPUs’ predilection for brute force: given a choice between 3 FLOPS and 3 atomic adds to disparate memory locations, or 20 FLOPS done on uniformly-referenced, read-only memory, the brute force 20-FLOP solution still wins.

Lesson learned.

It is possible that the global memory references by this kernel are not ideal (especially if it’s camping on a subset of the hardware that implements atomics), and it might be faster to conserve external bandwidth, perhaps taking inspiration from a tiled matrix transpose implementation – but the optimized implementation would have to outperform a shared memory implementation that’s almost twice as fast.

As with all CUDA handbook source code, this kernel is open source and available on github.

Multithreaded SSE Timings

The N-body code has been posted to github; it has not yet been ported to Linux, but that will not be difficult to do. It does not use graphics, it is just a Win32 console application that responds to keyboard input (ironically, other than porting the threading code, the keyboard support is the hard part about the Linux port of this app).

So far there are 8 formulations:

1) CPU_AOS (array-of-structures implementation – the gold standard to which we compare other implementations),
2) CPU_SOA (structure-of-arrays implementation) – a stepping stone to the SSE implementation (and surprisingly faster than the AOS version),
3) CPU_SSE (SSE implementation) – I blogged about this a few weeks ago,
4) CPU_SSE_threaded( multithreaded SSE implementation) – multithreaded SSE implementation that uses N threads on an N-core processor,
5) GPU_AOS – a straight port of the CPU_AOS code,
6) GPU_Shared –  GPU_AOS, optimized to use shared memory per the original CUDA N-body paper,
7) GPU_Atomic – a Kepler-specific version that uses atomics to do half as many body-body interaction computations, (but isn’t faster – I plan to blog about that later),
8) GPU_Shuffle – a Kepler-specific version that uses the new warp shuffle instruction instead of shared memory.

Here are some initial timing results for 16K bodies:

CPU_AOS – 4494 ms (60.0 Minteractions/s)
CPU_SOA – 3210 ms (83.6 Minteractions/s)
CPU_SSE – 612.6 ms (438 Minteractions/s)
CPU_SSE_threaded – 192 ms (1400 Minteractions/s)
GPU_AOS – 62 ms (4300 Minteractions/s)
GPU_Atomic – 93 ms (2886 Minteractions/s)
GPU_Shared – 55 ms (4881 Minteractions/s)
GPU_Shuffle – 58.7 ms (4573 Minteractions/s)

The fastest GPU performance is about 3.5 faster than the multithreaded SSE implementation. Another way to look at it: the SSE implementation will run on any CPU built since 2001 or so, and it’s only a factor of 3.5 slower. Then again, the code is super-gnarly compared to the CUDA code, especially the threading code. (Don’t take my word for it – take a look for yourself.) So here again, CUDA “wins” as much for programmability as for the performance win.

The interesting thing is that this data is gathered on a rather old CPU (a 2.8 GHz Intel i7 “Nehalem”), and the GPU is a relatively recent GK104. Once I get the code running on a contemporary CPU (say, a Sandy Bridge) I’ll report timings on that platform.

I did take a look at an AVX port, but that seemed even more off-topic than the SSE port, which was a stretch. That project will have to wait until the book’s done.

Next up: a Linux port and multi-GPU support.

Rough Cuts

I have updated the Web site’s Sample Chapters section to point to The CUDA Handbook‘s page in Safari Books Online’s Rough Cuts. Rough Cuts gives early access to books while they are still in progress – 3-6 months from publication, and includes tools for readers to give feedback to the author while there’s still time to make changes.

Several chapters are still missing, but they will be uploaded soon.

I’m still interested in reviewers who would be willing to take a look at sample chapters and submit feedback. Write me at nicholas@archaeasoftware.com.