Category Archives: Uncategorized

CUDA Graphs, ROI, and API Adoption

CUDA 10 adds a new API called “CUDA Graphs” that are immediately familiar to graphics API designers: they are a scene graph API for compute. Scene graph APIs enable developers to describe geometry at a “higher”™ level, in ways that express the relationships between, say, rooms and doorways within a castle or the arms and legs of a 3D character. The idea is that with this additional information, the API implementor (in this case, NVIDIA) can write code that will traverse the scene graph (say, rendering the characters with their limbs animated) more efficiently than code written by the developer. Either that, or the scene graph API is sufficiently easier to learn than learning how to write the scene graph code that developers can achieve faster time-to-market by learning and using the scene graph API.

I am skeptical that CUDA Graphs will achieve adoption outside NVIDIA’s SDK samples.

API designers drive adoption by maximizing the return on investment, where the return is efficient, working code and the investment is developer time. APIs that are not easy to learn are disadvantaged because every developer who writes or maintains the code must invest in learning the API. APIs that don’t deliver a compelling performance advantage must be *very* easy to learn, hence conferring an expressive advantage. (i.e. faster development times.)

CUDA adoption has been driven by delivering huge performance gains (the return) despite a steep learning curve (the investment). (It makes for an interesting thought-experiment to wonder why CUDA has succeeded and other manycore platforms have not. Although this blog post does not touch on the issue, customer investments must be considered in addition to developer investments.)

An early API (in fact, it was created in the 1970s, long before the term “API” had been invented) that delivers high ROI is BLAS, the Basic Linear Algebra Subprograms. Originally written in FORTRAN, the motivations for this library were twofold: to “provide names and argument lists that might become widely used and recognized for some of the basic operations of computational linear algebra,” and “to improve efficiency of math software.” BLAS code is reasonably performance- and platform-portable. As the underlying platforms evolved, the same BLAS code benefited transparently from assembly language hand-coding to cache blocking to SIMD instruction sets. There was no need to update the API client code as the implementation changed underneath. BLAS has achieved widespread adoption in numerical code, amplifying developers’ expressive power and enabling them to leverage the development effort invested by others in its implementation. At this point, BLAS gets an inordinate amount of attention from hardware vendors, making it unlikely that developers can match its performance without exploiting a priori knowledge of their application requirements. It takes time to learn, but delivers a considerable return on that investment.

On the other end of the spectrum, an API that has high ROI by minimizing developer investment is malloc()/free(). Learning first-hand the difficulty of writing a fast, robust memory allocator has been an inflection point in many junior developers’ careers – it’s harder than it looks. Other APIs that deliver a high return with minimal investment: the thread synchronization APIs built into operating systems. They are not hard to learn and, for most developers, impossible to implement.

In the early days (DirectX 2.0-3.0), Direct3D had a scene graph API called the “retained mode,” but the last version shipped in 1996. No one was using it, despite heroic evangelism efforts by its developers. Developers could use “immediate mode” APIs to implement their own scene graphs more efficiently – both in terms of developer time and in terms of high-performance implementations of the operations they needed. As an added bonus, by writing the scene graph traversal themselves, developers kept all the IP in-house (e.g., their visibility algorithm) and, if there was a bug, they could fix it in their code on their own schedule.

Since game developers co-design their content development tools with the runtime, a great deal of intellectual property is encapsulated in the scene graph traversal. In a sense, 3D scene graph API designers were aspiring to co-opt developers’ core IP – never a winning proposition for a platform.

I suspect that CUDA developers will come to similar conclusions with the CUDA Graphs. No one will use them unless they deliver a return on investment in the form of higher performance, or greater expressiveness commensurate with the effort to learn the APIs. Higher performance will be difficult to achieve since CUDA gives developers ready access to the underlying tools used by the CUDA Graphs.

One possible opportunity for NVIDIA: perhaps CUDA Graphs will be an efficient way to enable concurrent execution of kernels that weren’t designed to run in streams? CUDA streams are like const correctness – it is difficult to retrofit code to use them because they must be plumbed into interfaces from top to bottom. An alternative to revisiting interfaces top-to-bottom is to add a “current stream” API (as CUBLAS did), but current-anything APIs interoperate poorly and tend to be inefficient at changing the current-thing. More importantly, the current-thing state must be saved and restored across interfaces.

So one path to adoption for CUDA Graphs may be an efficient way to enable concurrent execution of kernels that weren’t designed to use streams. But in general, like immediate-mode graphics APIs, most developers will be able to more quickly write their own code expressing the dependencies in their application than it would take to learn and use the CUDA Graphs APIs. And developer-authored code will run at least as fast, paying tribute to the First Law Of CUDA Development.

Unless CUDA Graphs deliver a high ROI, they will go the same way as other features that Seemed Like A Neat Idea At The Time, like dynamic parallelism and managed memory.

Don’t Move The Data!

NVIDIA just delivered their first Volta-enabled DGX-1 systems – great news for those who need the additional compute power of GV100 versus GP100:

GP100 GV100
FP32 Compute 10.6 TFLOPS 15.0 TFLOPS
FP64 Compute 5.30 TFLOPS 7.50 TFLOPS
Memory Bandwidth 720 GB/s 900 GB/s


Wait, you say, that’s an interesting qualifier. Who doesn’t “need the additional compute power…?” Did someone hack into Nick’s blog account and post on his behalf? Or has he become a Luddite in his dotage?

Nope, no, I still think more compute is generally better; but it is past time to question the architecture of these systems with huge, discrete GPUs connected to the world by buses. The problem with DGX-1 is that those GPUs are hungry! They need to be fed! And they can only sip data through the tiny soda straw known as the PCI Express bus.

For perspective, let’s compare these chips to G80, the first CUDA-capable GPU. Let’s set the stage by observing that G80 was the largest ASIC NVIDIA could feasibly design and fabricate in 2006, straining the limits of contemporary fabrication technology – a classic “win” chip. It had 684M transistors, a theoretical maximum performance of 384GFLOPS for single precision, and no support at all for double precision. GP100 and GV100 respectively have 22x and 31x more transistors, and 27x and 39x more single precision performance than G80. But the bandwidth to deliver data to and from these GPUs has not been increasing commensurately with that performance.

Here’s a table for all 3 GPUs – G80, GP100 and GV100 – that highlights the FLOPS/byte of bandwidth for device memory (attached to the GPU), NVLINK (NVIDIA’s property GPU-GPU interconnect), and PCI Express:

G80 GP100 GV100
GFLOPS (SP) 384 10600 15000
GPU↔GPU memory 84 GB/s 720 GB/s 900 GB/s
FLOP/Byte 4.5 14.7 16.67
GPU↔GPU n/a 20 GB/s 20 GB/s
FLOP/Byte 530 750
CPU↔GPU 3.1 GB/s 3.1 GB/s 3.1 GB/s
FLOP/Byte 124 3419 4839

The 3.1GB/s figure comes from dividing the available PCIe bandwidth by the number of GPUs in the system. Two 16-lane PCIe 3.0 connections are about 25 GB/s observed, and there are 8 GPUs.

As the number of FLOPS per byte of I/O diverges, the number of workloads that benefit from more FLOPS diminishes. Googling around for literature on FLOPS/byte, I ran across this 2011 presentation by Peter Kogge entitled “Hardware Evolution Trends of Extreme Scale Computing.” For anyone in the GPU business, the first sign that something’s amiss crops up in Slide 3, which cites “1 byte/FLOP” as the “classical goal.” Even G80’s device memory fell well short of that goal with 4.5B/FLOP (here, higher numbers are worse). I prefer this framing because it adopts the viewpoint of scarcity (bytes/FLOP – getting data in and out for processing) rather than abundance (FLOPS/byte – having lots of processing power to bring to bear on data once it is in hand).

The presentation is from 2011, but still very relevant: after reviewing Moore’s Law and the rise and fall of Dennard scaling, and the preeminent importance of power dissipation in modern computing, the concluding slide reads in part:

  • World has gone to multi-core to continue Moore’s Law
  • Pushing performance another 1000X will be tough
  • The major problem is in energy
  • And that energy is in memory & interconnect
  • We need to begin rearchitecting to reflect this …

“DON’T MOVE THE DATA” has been good advice to everyone who’s had the data for decades (in 1992 I wrote a Dr. Dobb’s Journal article that focused on hand-coding x87 assembly to keep intermediate results in registers)… but the advice has more currency now.

Moving The Data on CPUs

The data/compute conundrum finds expression on modern multi-core CPUs, too. Each core on a modern x86 CPU has ILP (instruction level parallelism) of 5, meaning it can detect parallelism opportunities between non-dependent instructions and execute up to 5 instructions in a single clock cycle. Latency to the L3 cache is about 50 clock cycles. So a CPU core can perform dozens of FLOPS on data in registers during the time it takes for the L3 to service a load (conservatively – 2 of the 5 pipelines can do 8 FLOPS per instruction via AVX). And that’s assuming the data was in cache!

As an aside, this observation helps explain why “optimized” numerical Python code is still dead slow. Python is interpreted, so has a library called Numpy that wraps vectorized implementations of operations that do things like element-wise addition or multiplication between arrays. But for arrays that don’t fit in cache (and to some extent, even arrays that do fit in cache), it is very inefficient to do multiple passes over the data if the computation could have been fused into a single pass. The code spends all of its time moving data, and very little time processing it.


A Gift From Heaven: Deep Learning

Which workloads, pray tell, require endless FLOPS per byte of I/O? Or turn it around and ask, which workloads still thrive when there is barely any I/O per FLOP? NVIDIA hasn’t been shy about trumpeting its solution to this problem: deep learning! Training a deep learning network entails refining floating point weights that roughly represent neurons that “learn” as they are trained on the data. As long as the weights can reside in device memory, only a modest amount of I/O is needed to keep the GPU busy. In retrospect, NVIDIA is extremely fortunate that deep learning cropped up. Without it, it’s not clear what workload could soak up all those FLOPS without the GPUs starving. The importance of machine learning as a workload helps explain why GV100 contains purpose-built hardware for machine learning, in the form of the TensorCore. But that hardware actually exacerbates the GPU starvation problem, by increasing FLOPS without increasing bandwidth.

NVIDIA probably isn’t comfortable betting the farm on a single workload – especially one where their main customers are enterprises that can invest in their own machine learning hardware and that is attracting VC money for application-specific hardware. How do you hedge? How can NVIDIA relieve the bottleneck? Unless some workload materializes that is as compute-intensive (per byte of I/O) as machine learning, NVIDIA must seek out ways to address their GPUs’ I/O bottleneck.

I/O: NVIDIA’s Strategic Landscape

The problem confronted by NVIDIA is that they are hindered by some business and legal challenges. According to the terms of their 2011 settlement with Intel, 1) They do not have a license to Intel’s industry-leading cache coherency protocol technology, and 2) they do not have a license to build x86 CPUs, or even x86 emulators.

NVIDIA has done what they can with the hand they were dealt – they built GPUDirect to enable fellow citizens of the bus (typically Infiniband controllers) to access GPU memory without CPU intervention; they built NVLINK, a proprietary cache coherency protocol. They have licensed NVLINK to IBM for the POWER architecture and signaled a willingness to license it to ARM licensees. The problem is that POWER and ARM64 are inferior to Intel’s x86, whose high-end CPU performance is unmatched and whose “uncore” enables fast, cache coherent access across sockets. NVIDIA itself, though an ARM licensee, has announced that they will not be building a server-class ARM chip.

I’m not sure why NVIDIA announced they would not be building their own ARM to drive their GPUs, because that seems like an obvious way for them to own their destiny. It may be that NVIDIA concluded that ARM64 cores simply will never deliver enough performance to drive their GPUs. That’s too bad, because there is a lot of low-hanging fruit in NVIDIA’s driver stack. If they made the software more efficient, it could either run faster on the same hardware or run at the same speed on lesser hardware – like ARM64 cores.

Not being able to coordinate with Intel on the cache coherency protocol has cost NVIDIA big-time in at least one area: peer-to-peer GPU traffic. Intel could, but chooses not to, service peer-to-peer traffic between NVIDIA GPUs at high performance (Intel and NVIDIA give different stories as to the reason, and these conversations happen indirectly because the two companies do not seem to have diplomatic relations). As things stand, if you have a dual-CPU server (such as NVIDIA’s own DGX-1) with cache coherency links between the CPUs, any peer-to-peer GPU traffic must be carefully routed past the CPUs, taking care not to cross the cache coherency link. If Intel could license QPI to Altera, they could license it to NVIDIA. Failing to do so is a matter of choice and a by-product of the two companies’ respective positions in the business and legal landscapes.

As things stand, NVIDIA is dependent on Intel to ship great CPUs with good bus integration, and peer-to-peer-capable GPU servers have to be designed to steer traffic around the QPI link. The announcement that NVIDIA would not build ARM64 SOCs was done in 2014, so now that the competitive landscape has evolved (and though I can remember when Intel’s market capitalization was 12x NVIDIA’s, it is now only about 1.7x), it would not surprise me if NVIDIA revisited that decision.

One Path Forward: SoCs

One partial solution to the interconnect problem is to build a System on a Chip (SoC): put the CPU and GPU on the same die. Intel and AMD have been building x86 SOCs for many years; it is Intel’s solution to the value PC market, and AMD has behaved like their life depended on it since 2006, when they acquired GPU vendor ATI. NVIDIA’s Tegra GPUs are all ARM SoCs. The biggest downside of SoCs is that the ratio of CPU/GPU performance is fixed years before the hardware becomes available, causing workloads to suffer if they are more CPU- or GPU-intensive than the SoC was designed to address. And if the device doesn’t have enough performance, scaling performance across multiple chips may be more difficult because GPUs require such high bandwidth. A conspicuous success story for big SoCs has been in the gaming console market, where the target workload is better-understood and, in any case, game developers will code against whatever hardware is in the console.

So I suspect that as workloads continue to tap out the FLOPS and balance out the bandwidth/FLOPS, big SoCs will start to make more sense. In sizing the CPU/GPU ratio, hardware designers can create a device with the biggest possible GPU that doesn’t starve with the available bandwidth.

SoCs are just a stopgap, though. As the laws of physics continue to lower the boom, the importance of system design will continue to increase, as Kogge pointed out in his 2011 presentation. The fundamental problem of the speed of light isn’t going away… ever.


Managed Memory and Segmentation

At the GPU Technology Conference this year, I ran into an old colleague from NVIDIA and the topic of managed memory came up. He related that earlier in the conference, the Q&A session after a Dell-sponsored presentation on managed memory had suffered a serious decline in the level of discourse. Neither of us had been in attendance, but apparently one questioner after another stood up and asked the presenters, in effect, “Where are you going with this?”

He made it sound like the presenters thought they were going to be pelted with rotten fruit!

Why the controversy? Managed memory is supposed to make CUDA programming simpler. It is intended to eliminate the need to copy data back and forth between CPU and GPU memory. If it were performance-neutral like, say, C++ lambdas, then managed memory would be a welcome addition to CUDA. The problem is that any feature that presents a risk of degrading performance will be viewed with skepticism by the CUDA programming community – because no one programs CUDA for fun.

What is it about CUDA that makes it so difficult to automatically manage residency of memory?

Reflecting on the answer to this question brought me back to a conversation I had at a different GTC, so long ago that I’m not sure which year it occurred. I had just met Daniel Moth, the Program Manager at Microsoft for C++ AMP. Once it was firmly established that we were fellow travelers, charting the technical roadmaps for competing data parallel programming environments, he had a question.

“Tell me one thing,” he asked. “Why do you need streams and events?”

I had to think for a minute. Why indeed? I’d added the feature in CUDA 1.1, to cover new hardware that could DMA host memory concurrently with kernel execution; but it was already clear that the new abstractions were future-proof to multiple kernels executing concurrently, and even coordinating execution between multiple GPUs.

“Streams are like CPU threads,” I told him stupidly, quoting from the original design document I’d written in 2007. “Operations that are done in different streams can happen concurrently. And you need events to coordinate execution between streams.”

“But we don’t need that stuff in C++ AMP. The stuff that can be done in parallel, we just do it in parallel.”

After a few minutes’ conversation, the key difference emerged and I finally had it.

“Oh,” I cried. “CUDA has a flat address space!”

C++ AMP does not.

CUDA’s address space causes more trouble than is widely appreciated. Because pointers can be stored in device memory, any CUDA kernel can attempt to access any address. In early versions of CUDA, where paging is not supported (every byte of virtual memory is backed by physical memory) and systems with multiple GPUs were rare (and certainly not for sale in the public cloud), having an address space seemed to make sense. That impression was bolstered by the hardware design community’s ideological commitment to linear address spaces, which had taken root after a divisive debate contrasting linear address spaces with segmentation.

Linear Addressing versus Segmentation

Segmentation is the idea that memory should be modeled as a set of discrete buffers with base pointers and lengths, rather than assigning an address (like a PO Box) to each byte of memory. Segmented memory is accessed via a segment/offset tuple instead of by a single address, a paradigm that is implemented at the hardware level. Intel’s x86 architecture was segmented from the beginning (c. 1976). It provided for 4 segments to be accessible at any given time: the segment registers CS, DS, SS, and ES were for code, data, stack, and “extra” data, respectively. Each segment register had a base address and a length, and most machine instructions implicitly referenced a segment that represented a sensible default. The PUSH and POP instructions that operate on the stack implicitly referenced the stack segment (SS). Loads and stores from memory implicitly used DS, the data segment, unless that default was overridden by a “segment prefix” instruction. For example, the SS: prefix could be used to operate on stack memory.

The problem with segments was that they made code difficult to compose: even simple operations like function calls were complicated by potential differences between the segment register settings needed by the caller and callee. The callee could save and restore its segment registers at the subroutine boundary, but that hurt performance. More typically, developers would select a “memory model” with fixed segmentation usage that was appropriate for their application. So-called “large” memory models would just specify a segment:offset tuple for every address; under MS-DOS, this amounted to a cheesy way to enable 20-bit addressing with 32-bit addresses, or 1M of memory with 4G worth of address width. It also hurt performance since every load and store needed a segment override.

Segmentation introduced difficult, but solvable, problems for developers of individual applications; but even 25 years ago, it was clear that plugin architectures like OLE automation would play a central role in future software development. Being able to load code and data dynamically into an application and have it “just work,” without having to worry about segments, was of paramount importance. The ability for libraries to efficiently access their callers’ data, and process it on their behalf, overrode the concerns that buggy code could corrupt data that happened to be accessible.

Segmentation and flat addressing can be reconciled by enabling large segment offsets and having the operating system map all the segments to cover the same address range. This usage was anticipated when the Intel i386 was released in 1986, and implemented in 32-bit multitasking operating systems like UNIX (or Microsoft’s long-lost Xenix), and later, OS/2 and Windows NT. This paradigm was so popular, and the need for segmentation support in hardware so unclear, that AMD mostly did away with segment registers when they revised x86 to enable 64-bit addressing in the early aughts.

When I wrote the specification for CUDA textures, with a clear separation between memory and views on the memory (CUDA arrays and texture/surface references, respectively), it quickly became clear that CUDA arrays were effectively segmentation. A CUDA kernel can’t access just any CUDA array; the CUDA driver must predeclare the CUDA arrays to be accessed by a kernel. Coupled with other per-launch parameters, such as the amount of shared memory and the number of registers needed, a CUDA kernel launch more closely resembles a container launch than a subroutine call.

A key reason segmentation was an abject failure for general-purpose computer architectures was the high cost of “switching segments” on a per-instruction basis. On x86, instructions such as LDS (load data segment) were costly; instruction prefixes to change the segments being operated on by a given instruction added complexity; and naïve systems that kept segment:offset tuples for all pointers essentially wasted addressing bits. Now that we have 64-bit addressing, it is possible to envision having page tables play the role of segments (by introducing a byte-granular limit to page table size), as argued in this blog post. For now, however, there is a decisive consensus in favor of flat address spaces.

What does all this have to do with managed memory?

By implementing segmentation on a per-kernel basis instead of a per-machine-instruction basis, GPU computing technologies get many of the benefits of segmentation, without the costs that hindered adoption on the CPU side. Kernels may take slightly longer to launch than they would otherwise, but the cost of a kernel launch is high enough that the additional cost of segmentation is negligible. And if each kernel launch predeclares the needed segments, the system can infer residency requirements, ensure coherency, and identify parallelism opportunities, much in the same manner that superscalar CPUs use real-time dependency analysis to identify which instructions can execute in parallel.

What do you mean by “infer residency requirements,” you ask? You guessed it: managed memory!

What do you mean by “identify parallelism opportunities,” you ask? You guessed it: automatic CUDA streams!

What about coherency? Not much would change here. The CUDA driver already uses software mechanisms to enforce coherency, for example, by inserting cache-invalidate instructions into the command stream before launching kernels that read from texture. In a segmented memory architecture, read-only segments can be copied where they are needed, then discarded without having to worry about propagating changes to the data. Writeable segments could be copied back wholesale, or using dirty bit optimizations.

So, it is not hard to imagine a GPU computing technology that uses segmentation to manage memory rather than a linear address space. In fact, we do not have to imagine OpenACC – it’s already here – and for CUDA, programs that used only CUDA arrays would have the properties needed to automate residency and parallelization. As a side note, the WDDM display driver model introduced in Windows Vista embraced a segmented memory architecture for paging.

Let’s review some of the deficiencies in managed memory, as discussed in my previous blog. It attempts to infer residency requirements based on memory accesses – which hurts performance and breaks the First Law of CUDA Development. It breaks the useful ability to infer the “owner” (CPU or which GPU) of a given address in the Unified Virtual Address Space. The semantics of multi-engine and multi-GPU memory management are complicated, and require hinting. Even if we set aside warranted skepticism about whether the hinting will be future-proof (I have my doubts), it introduces enough complexity that managed memory does not compare favorably to the static, affinitized allocations like CUDA 1.0 or segment-based architectures like OpenACC.

Managed Memory: Belated Comments on Implementation

Managed memory is a new-ish CUDA feature that aspires to do away with the need to explicitly copy CPU memory to and from GPU memory. Introduced in CUDA 6.0, its initial implementation was unusably slow. (For example, copying managed memory from GPU to CPU memory ran at 512MB/s, 25x slower than an asynchronous memcpy.)

When they undertook to build the managed memory feature, NVIDIA had many different implementation strategies they could have pursued. As best I can tell, here is a summary of their implementation strategy:

  1. Upon allocation of managed memory, the CUDA driver allocates device memory, plus a pageable range of CPU memory at the same virtual address range.
  2. The CUDA driver use dirty page bits to track which 4K pages were “touched” by the CPU.
  3. Upon kernel launch, the CUDA driver would unmap the managed memory from the CPU and copy the dirty pages from the CPU to the GPU. Unmapping the CPU memory removes the risk of write-after-read hazards from the CPU corrupting managed memory before the GPU was able to copy it.
  4. While CUDA kernels run, the device memory copy of the managed memory is the only valid one.
  5. Upon CPU/GPU synchronization, the CPU buffer is made accessible again, but is not copied wholesale from GPU memory. It is possible the GPU’s hardware does not have the same dirty bit tracking facilities as the CPU, or perhaps NVIDIA just thought it would be preferable to copy device memory back to the CPU “on demand.”
  6. Copying managed device memory back to host memory is prompted by page faults: when the CPU attempts to access a page of managed memory, the CUDA driver handles the page fault by copying the 4K of GPU memory to CPU memory.

The application I used to investigate NVIDIA’s managed memory implementation is only about 60 lines of code. The key component is a function usPerLaunch that allocates a specified amount of managed memory, launches a NULL kernel, synchronizes with the GPU, then optionally “touches” the managed memory to force the CUDA driver to copy it back to host memory. (In an earlier version of this test, I confirmed that CUDA lazily copies only “dirty” pages in the other direction, as NVIDIA claims in its documentation.)

Here is the function in question – it may be found in cudahandbook/concurrency/

const size_t pageSize = 4096;
usPerLaunch( int cIterations, size_t cPages=0 )
    cudaError_t status;
    double microseconds, ret;
    chTimerTimestamp start, stop;
    void *p = 0;

    cuda(Free(0) );
    if ( cPages ) {
        cuda(MallocManaged( &p, cPages*pageSize ) );

    chTimerGetTime( &start );
    for ( int i = 0; i < cIterations; i++ ) {
        cuda(ThreadSynchronize() );
        if ( bTouch && 0 != p ) {
            for ( int iPage = 0; iPage < cPages; iPage++ ) {
                ((volatile unsigned char *) p)[iPage*pageSize] |= 1;
    chTimerGetTime( &stop );

    microseconds = 1e6*chTimerElapsedTime( &start, &stop );
    ret = microseconds / (float) cIterations;
    cudaFree( p );
    return (status) ? 0.0 : ret;

I ran this program on a Haswell-based Windows 7 machine on two NVIDIA GPU boards: the NVIDIA GeForce GTX 970 and Titan X (GM200 and GP100, respectively). Although both are large “win” chips, I would expect similar test results to hold true across all Maxwell and Pascal GPUs, since they seem to have implemented a hardware interface that improved managed memory performance.


Launch time (ms) Memory (KB) Bandwidth (MB/s)
47 0
105 4 39
104 8 78
115 16 143
134 32 244
213 64 307
381 128 344
649 256 404
1247 512 420
2221 1024 472
4712 2048 445
8458 4096 496
17041 8192 492
33992 16384 494

Table 1. GM200 results.

Launch time (ms) Memory (KB) Bandwidth (MB/s)
39 0 0
47.15 4 7
49.86 8 164
57.84 16 283
59.04 32 555
64.73 64 1012
79.08 128 1657
98.41 256 2664
137.15 512 3823
205.56 1024 5101
391.91 2048 5351
745.81 4096 5624
1543.91 8192 5433
3114.83 16384 5386

Table 2. GP100 launch results.

“Better,” however, does not mean “good.” The most important thing to note is that these kernel launch times are VERY SLOW. You can measure synchronous and asynchronous kernel launch times with the and programs in the same directory. On this machine, those times are 46.35 and 3.25 microseconds, respectively. (In fairness, results likely would be better under Linux, especially the synchronous kernel launch. On Windows 7, launching a CUDA kernel always requires the driver to have the operating system do a user-kernel transition or “kernel thunk.” Sadly, no amount of editing can get around the sad fact that CUDA kernels and OS kernels are completely different things and some sentences must refer to both!)

On the Maxwell machine, whatever mechanism NVIDIA is using to copy managed memory back from the GPU has a maximum performance of less than 500MB/s. That’s a nonstarter. It is more than 25x slower than the bus bandwidth. Pascal has improved things, but is still less than half the performance of a PCI Express 3.0 link. A CUDA kernel reporting results via mapped pinned memory would achieve much higher performance.

Superficially, NVIDIA’s implementation makes sense, assuming there is one CPU and one GPU and that the application isn’t doing any fancy tricks with CPU/GPU concurrency. The main mistake in their implementation was failing to speculatively copy extra pages back from the GPU to the CPU in Step 6, an oversight that seems to have been remedied in subsequent releases. The overhead of servicing the page fault is so high that it’s dominated by interrupt handling, not copying of a 4K page, so it makes sense to copy more pages on the page fault until the overhead of the additional copying becomes non-negligible.

Less clear, however, is the optimal behavior of managed memory in a system with multiple GPUs. Does a managed memory buffer get allocated for each GPU? When a kernel is launched on GPU 0, do the other GPUs get copies of the managed memory? Which memory ranges are valid for which GPUs as kernels are executing? And it seems clear that managed memory can’t possibly retain the property that the “owning” device can be inferred from a UVA address, by e.g. calling cudaPointerGetAttributes().

The paradigm also breaks for applications that perform memory copies and kernel processing concurrently.

I submit that the APIs needed to “enlighten” the managed memory subsystem to do the right thing, are at least as complicated as simply writing the CUDA code to explicitly allocate and copy memory.

More on managed memory soon.

Ten Years Later: CUDA Succeeded Despite…

After posting a list of reasons why CUDA succeeded, it seems worthwhile to reflect on some of its apparent vulnerabilities, and why CUDA has been successful despite those issues.

CUDA Succeeded Despite…

1. Being Proprietary.

NVIDIA builds the hardware and software to run CUDA applications and has never licensed the technology to anyone else. Conventional wisdom in the industry holds that proprietary software technologies are doomed to failure – they don’t get shepherded well by a single owner, and they don’t gain adoption by developers. But by making CUDA software portable to everything from Linux to Windows to MacOS, and making CUDA hardware available in a broad range of products from SOCs (Tegra) to high end servers (DGX-1), NVIDIA has staved off the risks they incurred by going it alone.

2. Explicit Memory Management.

It’s every new CUDA programmer’s rite of passage: As if allocating and copying input and output data to and from device memory weren’t enough trouble, developers also explicitly manage shared memory to facilitate data interchange between threads.

Fortunately for NVIDIA, due to the First Law of CUDA Development, developers haven’t been fazed by the need to learn these idiosyncrasies.

3. Limited Cache Coherency.

Some rules of thumb have been internalized by hardware designers to such a degree that they are not so much sound engineering practices, but religious edicts. One such rule is that caches have to be coherent. All the time. In hardware.

But CUDA is pervaded by violations of this tenet. Device memory is not coherent with host memory. Shared memory effectively resides in a separate address space, so isn’t coherent in the same sense as an L1 cache. Constant and texture memory are not coherent with device memory, and when changes are made to the memory, the illusion of coherence is maintained via software invalidation. As with explicit memory management, developers are willing to treat the lack of cache coherency as a cost of doing business – as long as they get the performance they crave.

4. Limited PC market share.

Discrete GPUs only occupy about 25% of PC market share by unit volume, and NVIDIA competes with AMD in that space. NVIDIA’s limited market share helps explain why CUDA has had limited success achieving developer adoption in packaged PC software, even when there’s a good fit with the software requirements.

Put yourself in the shoes of an engineering director at (say) Adobe. “Port this code to CUDA,” says NVIDIA, “and it will run much faster… on 18% of your potential customers’ machines.” Even that proposition is sketchy when accounting for the costs and benefits of supporting the full range of CUDA GPUs extant.

But for vertical applications (think HPC), CUDA developers build data centers with thousands of identical servers. And for embedded applications (think automotive), every GPU in a given design win has identical properties. In both cases, developers have a fixed hardware target to develop against, and they get a compelling return on the engineering investment of the CUDA port.

In the longer term, companies like Adobe and Autodesk should be able to gain the same benefits by transitioning to cloud-provisioned GPU platforms.

Ten Years Later: Why CUDA Succeeded

CUDA first became available about 10 years ago, so it seems like a good time to take note of its success and reflect on why it has been successful.

1. GPUs are not CPUs.

What I mean by this is not just that you don’t have to recompile your app (this point gets its own bullet later in this article), but that core operating system changes are not needed for GPU support. GPUs are complicated peripherals, but when the rubber meets the road, they are still just peripherals. They hang off the bus, get enumerated by the OS, get a driver loaded, and go. Proponents of competing technologies such as the Cell processor or Larrabee (now Xeon Phi) would have you believe otherwise, but GPUs have been served well by the flexibility and platform portability that comes with being a “dumb peripheral.”

2. GPUs are everywhere.

Jensen Huang has said the GPU had a “day job.” NVIDIA had an established, high-volume market for their ASICs. The overlap in requirements between a big, fast graphics chip and a general-purpose manycore processor was significant, but it wasn’t obvious to all that the incremental cost would be worth it. I personally had lunchtime arguments with senior graphics architects at NVIDIA who didn’t want to spend 10% die area on compute (the estimated hardware cost of adding support for scatter/gather and shared memory) because it would put them at a disadvantage running graphics benchmarks against AMD (at the time, it was known as ATI). Fortunately for NVIDIA, those skeptics were overruled and the business risk turned out to be justified.

Another way to look at it: though NVIDIA was weighing a 10% die area risk, technologies like Cell and Larrabee/Xeon Phi, or companies like Ageia and other coprocessor vendors, were incurring a 100% die area risk. They did not have an established market to fall back on if things didn’t work out.

3. GPUs are compellingly faster than the CPU.

Shortly after one of our first, best customers for CUDA received his first CUDA-capable GPU, he contacted NVIDIA with a question. He had gotten a sample workload ported, and, he said, it looked like it was working. The problem? He wanted to know how it could be so fast!

The senior people at NVIDIA had long known GPU performance was going to be amazing. Shortly after I joined NVIDIA in 2002, I had lunch with a senior NVIDIA architect and asked him what he was working on. “NV50,” he said. (Mind you, this conversation occurred before NV30 had taped out.) “It will unify vertex and pixel shader processing. We’ll have room to build a chip with about a teraFLOPS of processing power, but we’ll spend half the area on graphics so it will have peak performance of about 500 GFLOPS.” Later, in an internal company email, the same architect said NV50 was going to “make the CPU look like a toy.”

His prediction turned out to be amazingly accurate, considering it was made four years and two major architectural revisions in advance. NV50 turned into G80, the first CUDA-capable chip, and had 384 GFLOPS of peak performance – within spitting distance of his casual lunchtime conjecture.

Remember that when CUDA first shipped, Intel’s floating point capabilities were much more limited than they are today. The SIMD width was only 128 bits (Sky Lake currently supports 512), and Intel had only recently widened the actual execution unit (singular – modern Intel CPUs have multiple SIMD execution units) to a full 128 bits. Before the Core 2 Duo, one generation after another of Intel CPUs had supported SSE as two micro-ops (“high” and “low”) for the 64-bit-wide execution unit, limiting instruction throughput. In fact, CUDA may have prompted Intel to dramatically improve their floating point capabilities.

Today, it is still true that for suitable workloads, GPUs are compellingly faster than CPUs. Intel has doubled the SIMD width in their processors twice, and also doubled the number of SIMD execution units, but in that time, NVIDIA has increased the number of transistors in their “win” GPU by 30x (from 684M to 21B), with a commensurate increase in performance. NVIDIA GPUs, by the way, still benefit from Dennard scaling because they target much lower clock rates than CPUs. In 2006, G80 ran at <600 MHz, while the latest GPU (V100) runs at 1455 MHz. NVIDIA also has led CPU vendors in advancing their instruction set support, being the first to add FP16 and fused multiply-add support. For these reasons, NVIDIA has held off Intel’s attempts to close the performance gap over the last 10 years.

4. CUDA has a low barrier to entry.

On the hardware side, this point goes hand in hand with how the GPUs already had an established, high-volume market. A CUDA GPU could be had for well under $1000, and as an added bonus you got to play World of Warcraft on a badass gaming card. Later, CUDA GPUs found their way into laptops. Still later, CUDA GPUs can be rented on an hourly basis in the cloud with a credit card.

So the barrier to entry to acquire hardware always has been low. But the same is true of Intel CPUs – they are inexpensive and everywhere. But unlike Intel, who charges for their vectorizing compilers, NVIDIA wisely chose not to charge for the toolchain. CUDA has always been free to download, and NVIDIA has never charged royalties to use it.

It’s hard to beat free, and when it came to hardware, it was hard to beat a GPU. With such a low barrier to entry, it is no wonder developers flocked to it.

5. CUDA is as easy to program as SSE/AVX.

I devote a whole chapter to this point in The CUDA Handbook, but it bears repeating. The portions of an application that are most amenable to CUDA acceleration are, for the most part, the same as for SIMD instruction set optimization. In either case, only a small portion of the application – certainly less than 10%, and in some applications, as little as 2% – needs to be ported to yield a benefit. So the question becomes, which technology gives the biggest return on the engineering investment?

Let’s pause for a moment to reflect on two things. First, Intel had a 10-year head start on NVIDIA in building compilers for their respective target technology (SSE versus CUDA). For Intel, that investment was in vectorizing compilers – compilers that examine scalar code and emit executable code that uses SIMD instructions. Second, despite that head start, that investment has delivered a limited return – partly because, as already mentioned, only small parts of an application actually benefit from SIMD optimizations, but also because vectorizing compilers have never fulfilled their promise. See for example this GDC 2015 presentation by Andreas Fredericksson. The game development company where he works avoids vectorizing compilers because an innocent-seeming change can cause the vectorization to break – a potentially catastrophic setback when most games have to be done in time for the holiday season (“This is what will happen two days before gold.”) Instead, they use compiler intrinsics, which use functions with names like _mm_add_ps()  to operate on special types with names like __m128. With few exceptions, these functions have direct analogs to machine instructions (in the case of _mm_add_ps(), the SSE instruction is ADDPS). From an engineering standpoint, intrinsics enable developers to take advantage of the new instructions without worrying about register allocation, instruction scheduling, or the intricacies of the ABI. (An especial challenge on x86-64.)

In stark contrast, CUDA lets you write scalar-looking code that alludes to the parallelism by referencing built-in variables such as threadIdx and blockIdx. I’d call the memory management issues a wash – in CUDA, you have to allocate and copy to and from device memory, but SIMD instructions have alignment restrictions and do everything 4 or 8 or 16 things at a time in a way that makes it difficult to deal with edge cases. I admit to being biased, but I have written a great deal of both types of code and I consider CUDA at least as easy to target.

6. CUDA has superior performance portability.

Performance portability is the idea that code will not just run correctly, but deliver high performance against a variety of platforms. For CUDA, performance portability within a given GPU generation is a given, as long as applications launch enough thread blocks to saturate the largest GPU. Performance portability across GPU generations is a bit sketchier, but has held up over time. Even features like FMAD (fused multiply-add) were added seamlessly, and always had native compiler support. NVIDIA has changed architectures and instruction sets with high frequency, but masks those architectural differences with a sophisticated mix of driver and compiler software.

On multicore CPUs, developers pursue performance along two axes: multithreading and SIMD. For multithreading, major operating systems have very different operations to manage threads and synchronization. Mutexes, semaphores, and events were all built into Windows; condition variables were in Linux, and added to Windows in Windows Vista. Windows also added reader-writer locks, mutexes that can accommodate multiple threads when the resource is being accessed in a read-only manner. When you add in the instruction-level support for thread synchronization (“interlocked exchange” or “compare and swap” primitives can be used to implement any number of thread synchronization primitives – especially the so-called “lockless” data structures), the number and variety of options for developers is overwhelming. No wonder process-level parallelism (i.e. eschewing threads entirely) has become a popular method of leveraging multicore CPUs!

On the SIMD side, Intel has added instructions about every 2 years, and increased the SIMD width twice since 1999. But software developers can’t immediately use new instructions without qualification. For one thing, since only new CPUs include the new instructions, applications must test which instruction set level is available, and run the corresponding code path. Applications must support “downlevel” hardware that corresponds to the installed base owned by their target users (notably, this calculation is different for a supercomputing data center as opposed to a consumer application such as Photoshop). One interesting data point: CCP, the company that makes the popular online game EVE Online, did not start requiring SSE2 on EVE clients until 2011. SSE2 first became available in 2001!

So for every instruction set innovation – notably AVX, AVX2, and now AVX-512 – new code must be written, along with detection code to ensure the “best” code paths are executed on the various flavors of CPU. If intrinsics are the developer tool of choice, the development burden grows linearly in the number of supported instruction set permutations. If you want both SSE and AVX implementations, you write twice as much code, and so on. But even that understates the burden of supporting a plethora of instruction sets, because we haven’t yet accounted for the QA burden. The QA department can’t get away with just running the code on CPUs that support all of the available instruction sets; they have to make sure the code is tested on CPUs that don’t support all of the target instruction sets. Otherwise, the QA process will overlook bugs in the detection code – the code that decides which code path to run, depending on CPU capabilities. Unless you are testing on hardware that doesn’t support the latest instructions, an SSE2 instruction (say) may find its way into your SSE code paths. And because newer CPUs  also support the older instructions, they will run that buggy code just fine. But on older CPUs, when they encounter the instruction they don’t support, they throw an exception and the application crashes.

Efforts to address the performance portability of multithreading and SIMD have been desultory at best. If you take the intersection of threading primitives across operating systems, you get something that resembles C++’s std::thread – useful only to the simplest of parallel applications. For SIMD, rather than vectorizing compilers, the technologies that offer the best prospect at performance portability are domain-specific languages like Halide – which also has a CUDA implementation.

7. You don’t have to recompile your app.

The siren song of parallel technologies has echoed through the years: “Just recompile your app!” The marketing folks would have you believe that all the latent benefits of parallelism will be laid bare by their magical compilers. The problem is that 95+% of the application won’t benefit at all, so much of that porting effort is for naught. Think about the millions of lines of code in a flagship application from a company like Adobe or Autodesk. Do you really think the engineering manager of such an application is excited at the prospect of having to port and re-test millions of lines of code that implement the user interface, file parsing, and other portions that won’t run any faster? What about interoperability with the installed base of third party plug-ins? The last time mainstream developers undertook full ports of their applications, it was for 64-bit addressing.

With CUDA, developers port the small percentage of an application that can benefit. The rest of the application stays the same. If it runs on systems without CUDA hardware, QA managers have to test both code paths, and make sure to test the variety of CUDA hardware that may run the application. It is nontrivial, but it’s a much smaller pill to swallow than having to recompile the entire application.


There you have it. As a final note, notice that whether the list is prioritized from top to bottom or the other way around, CUDA GPUs’ status as a peripheral (not a CPU) is a central reason they have been so successful.

Warp Synchrony and The First Law of CUDA Development

One of the most overlooked developments of GTC2017 was that NVIDIA’s Architecture Team has finally Had It Up To Here with developers who write warp synchronous code. As you may know, warp synchronous code relies on the way CUDA hardware executes 32-thread warps in lockstep. The CUDA Handbook contains some examples of warp synchronous code. In the reduction chapter, for example, warp synchronous code is used to optimize performance of the last 5 iterations of this loop that accumulates partial sums in shared memory:

for ( int activeThreads = blockDim.x>>1;
          activeThreads >>= 1 ) {
        if ( tid < activeThreads ) {
            sPartials[tid] += sPartials[tid+activeThreads];

Notice that every iteration of the loop is accompanied by a call to __syncthreads(), the intrinsic that serves as block synchronization primitive and memory barrier. The unrolled, warp synchronous implementation of the last 5 iterations looks like this:

if ( threadIdx.x < 32 ) {
    volatile int *wsSum = sPartials;
    if ( blockDim.x > 32 ) wsSum[tid] += wsSum[tid + 32];
        wsSum[tid] += wsSum[tid + 16];
        wsSum[tid] += wsSum[tid + 8];
        wsSum[tid] += wsSum[tid + 4];
        wsSum[tid] += wsSum[tid + 2];
        wsSum[tid] += wsSum[tid + 1];
    if ( tid == 0 ) {
        volatile int *wsSum = sPartials;
        out[blockIdx.x] = wsSum[0];

The volatile keyword represents NVIDIA’s grudging acceptance of warp synchronous code. Historically, volatile is a keyword that hints to the compiler not to optimize out memory traffic through the associated pointer. The classic application is for device drivers for hardware with memory-mapped hardware registers, where reads and writes to “memory” are used to program the hardware. But volatile doesn’t give the compiler enough information; although it inhibits optimizations such as reusing registers or conserving memory writes, it’s not expressive enough to capture the synchronization semantics required when threads within a warp can diverge.

As a result, with Volta’s improved support for divergent code execution, NVIDIA is giving up on the volatile keyword workaround and deprecating all warp-level primitives. Instead, developers are encouraged to use new intrinsics with “_sync” appended. So instead of calling any(), the function that returns True if the input predicate expression is true for any of the 32 threads in the warp, we are to call any_sync().  The new function may be invoked on older hardware, and I suspect they are synonyms for the older functions; but on Volta, it likely will enforce semantics that converge execution across the warp.

After listening to the presentation at GTC, I sought out an NVIDIAn and told them that CUDA developers have always known that warp synchronous coding wasn’t strictly correct. NVIDIA has been finger-waggling at CUDA developers who write warp synchronous code for years! To gain some insight into why developers do it anyway, we turn our attention to a completely unscientific survey of developers where they were asked why they write CUDA code: FirstLaw_1Figure 1. Motivations for CUDA Development

I call this the First Law of CUDA Development: Performance is CUDA’s raison d’être. No one writes CUDA code for fun. Every CUDA user is trying to get a return on investment in the form of higher application performance. The reason developers write warp synchronous code even though it’s the “wrong” thing to do is because it is faster. Put another way, sprinkling __syncthreads() calls that turn out to be superfluous is… well… slower. (A subtler implication is that if the behavior does not change, it is harder for developers to tell which __syncthreads() calls are superfluous). Developers always want to do the right thing, I told the NVIDIAn; but ultimately, if you want developers doing the right thing, you have to make the right thing also be the fastest thing.

During the course of the conversation, the NVIDIAn defended the idea that they should break warp synchronous code in the future: “If I warn you to look both ways before you cross the road, don’t blame me if you get hit by a car.” I told him: “If that is your position, it’s your responsibility to make sure that developers who don’t look both ways ALWAYS get hit by a car.”

Why does CUDA CUdeviceptr use unsigned int instead of void?

This question on StackExchange was put on hold as primarily opinion-based: “…answers to this question will tend to be almost entirely based on opinions, rather than facts, references, or specific expertise.”

The content of StackExchange is usually high quality, but in this case, while the design decision was based on opinion, the answer to the question needn’t be… you just need to ask the people who know! And the inimitable talonmies, who is poised to crack 30k on StackExchange’s points-based reputation system, compounded the problem by saying that CUdeviceptr is a handle to a device memory allocation, not a pointer.

I don’t think I have ever seen talonmies give an incorrect answer before; but in this case, he’s off the mark. CUdeviceptr always has represented a pointer in the CUDA address space. In fact, though it was frowned upon to mix driver API and CUDA runtime code, even in CUDA 1.0 you could transform between CUDART’s void * and the driver API’s CUdeviceptr by writing something like:

void *p;
CUdeviceptr dptr;
p = (void *) (uintptr_t) dptr;
dptr = (CUdeviceptr) (uintptr_t) p;

We could have made device pointers void *, but there was a desire to make it easy for compilers to distinguish between host and device pointers at compile time instead of runtime. Furthermore, SM 1.x hardware only supported 32-bit pointers, so using void * would have created a difference in pointer size on 64-bit host platforms. It’s a long-distant memory now, since so much great compiler work has gone into CUDA since then, but at the time “pointer-squashing” (having CUDA’s compiler transform 64-bit pointers into 32-bit pointers on 64-bit host systems) was a big issue in early versions of CUDA.

For the record, not making the driver API’s device pointer type void * is one of my bigger regrets about early CUDA development. It took months to refactor the driver to support 64-bit device pointers when hardware support for that feature became available in SM 2.x class hardware.

In fact, some weeks before we released CUDA 1.0, we had a meeting and a serious discussion about replacing CUdeviceptr with void *, and decided not to take the schedule hit. We weren’t going to let perfect be the end of done, and we paid the price later.

While we’re on the topic of regrettable design decisions in early CUDA, I wish I had done a search-and-replace to convert cuFunction to cuKernel, and put cuLaunchKernel in the first release (in place of the stateful, chatty and not-thread-safe cuParamSet* family of functions). But we had scant engineering resources to spend on fit and finish, a constraint that is no less true for CUDA than for many other successful software projects in history.

Floating point: CPU and GPU differences

Some time ago, I wrote this in response to a StackOverflow question, but wanted to share here on the blog.

The question basically asked how you could make a floating point operation the same between the CPU and the GPU, and here is an updated version of the answer:

There are many reasons why it is not realistic to expect the same results from floating point computations run on the CPU and GPU. It’s much stronger than that: you can’t assume that FP results will be the same when the same source code is compiled against a different target architecture (e.g. x86 or x64) or with different optimization levels, either.

In fact, if your code is multithreaded and the FP operations are performed in different orders from one run to the next, then the EXACT SAME EXECUTABLE running on the EXACT SAME SYSTEM may produce slightly different results from one run to the next.

Some of the reasons include, but are not limited to:

  • floating point operations are not associative, so seemingly-benign reorderings (such as the race conditions from multithreading mentioned above) can change results;
  • different architectures support different levels of precision and rounding under different conditions (i.e. compiler flags, control word versus per instruction);
  • different compilers interpret the language standards differently, and
  • some architectures support FMAD (fused multiply-add) and some do not.

Note that for purposes of this discussion, the JIT compilers for CUDA (the magic that enables PTX code to be future-proof to GPU architectures that are not yet available) certainly should be expected to perturb FP results.

You have to write FP code that is robust despite the foregoing.

As I write this today, I believe that CUDA GPUs have a much better-designed architecture for floating point arithmetic than any contemporary CPU. GPUs include native IEEE standard (c. 2008) support for 16-bit floats and FMAD, have full-speed support for denormals, and enable rounding control on a per-instruction basis rather than control words whose settings have side effects on all FP instructions and are expensive to change.

In contrast, CPUs have an excess of per-thread state and poor performance except when using SIMD instructions, which mainstream compilers are terrible at exploiting for performance (since vectorizing scalar C code to take advantage of such instruction sets is much more difficult than building a compiler for a pseudo-scalar architecture such as CUDA). And if the wikipedia History page is to be believed, Intel and AMD appear to have completely botched the addition of FMAD support in a way that defies description.

You can find an excellent discussion of floating point precision and IEEE support in NVIDIA GPUs here.

Final Manuscript Submitted

I submitted the final manuscript last Friday, and thought I would reflect briefly on how it aligns with my original goals.

I’ve wanted write a book on CUDA for years. Until I left NVIDIA, I was just too busy building CUDA to work on it. So when it came time to write up a proposal, I’d been thinking about the subject matter and the organization for some time.

One of the exercises that authors undertake (and that editors demand as part of a proposal) is a competitive analysis: What books already exist that address the topic? How will yours be different? Why would someone buy your book instead of one of those other books? I knew I wanted my book to be comprehensive, covering topics like the driver API, all CUDA hardware, and all CUDA-capable platforms. I wanted it to cover both software abstractions (like streams and events) and how to write CUDA kernels. And as I weighed those aspirations and put together outlines and looked at the competitive landscape, it became clear to me: Writing a book on CUDA is hard.

When I said that to Ian Buck, he got an expression like a whipped puppy. “But why?” he asked. Hearing that anything about CUDA is hard upsets Ian because he believes the key to CUDA’s success has been simplicity. I told him, “Because in order to cover a topic correctly, I wind up explaining the same thing more than once, from different perspectives.” That is true, and is reflected in my book; but even books that cover only the CUDA runtime and the latest CUDA hardware can’t get away from the fundamental difficulty of CUDA programming: performance optimization is hard, and no one uses CUDA because it is cute and cuddly. People only ever use CUDA because it can run their applications faster. The reason CUDA programmers worry about performance bottlenecks and implementation details is because those considerations are weighed by developers engaged in low-level optimization, whether on GPUs or CPUs.

In looking at the landscape, I saw a lot of books that presented a lot of material on CUDA, but hadn’t imposed much of an organizational structure. So my book is organized into three parts. Part I gives overviews of the hardware, the software, and the operating environment; Part II (“Details”) gives in-depth descriptions of various CUDA abstractions, like memory, streams and events, and texturing; Part III (“Select Applications”) covers the full gamut of classes of CUDA application: streaming workloads (where PCI Express transfer overhead figures prominently), key parallel algorithms Reduction and Scan, an illustrative compute-intensive workload (N-body, the anti-streaming workload), and an image processing workload that combines texturing and shared memory for performance.

The source code accompanying Part II consists almost entirely of microdemos and microbenchmarks, while the source code accompanying Part III consists of several implementations of the same exact operation, with different tradeoffs in performance and complexity. Some of the microbenchmarks in Part II are intended to be reused (plug your own kernel into a makework kernel, for example), while reuse of code from Part III may be trickier. I did what I could, but reuse of application-specific code is intrinsically more difficult.

The code in the Streaming Workloads chapter can definitely intended be reused – just replace the SAXPY kernel with a different kernel, the more compute-intensive the better. Even calculations as complex as Black-Scholes options computation are transfer-bound on CUDA hardware*.

With Scan, one problem that hinders reuse is that NVIDIA has added numerous primitives that make Scan more efficient (like syncthreads_count() and warp shuffle). For a book that aspires to cover all CUDA hardware (all the way back to the seminal SM 1.x, “Tesla” hardware), that presents a challenge. For another, Scan has so many variants (inclusive/exclusive, segmented scan, predicate scan) that covering all permutations just of the basic primitive requires a lot of space, without covering any applications like Radix Sort or stream compaction. And if you try to cover all those permutations in the source code, you wind up with code that bears a disturbingly close resemblance to Thrust, except that Thrust was written by NVIDIA employees with a much more intimate understanding of modern C++ programming idioms than yours truly.

I will say this: The Scan chapter does a better job of covering the topic than any other book I have seen. I could have covered more, but then again, you could devote 100+ pages to the topic without covering everything.

For N-body, the fastest implementation just stages tiles into shared memory to make the data available to the SMs with lower latency; but after reviewing the literature on computationally-dense workloads, I saw an opportunity to broaden coverage beyond that. For one thing, some applications need shared memory for other purposes than a read-only, software-managed cache. The Direct Coulomb Summation code, described by Stone et al. and that is covered in detail in Kirk & Hwu’s Programming Massively Parallel Processors, stages read-only atom descriptions through constant memory, working around the 64K constant memory limit by doing the computation on 4,000 16-byte atoms at a time. So my book has an implementation that mirrors that strategy, even though it is slightly slower than the shared memory formulation.

A colleague pointed out that some N-body computations – in particular, the ones in the AMBER molecular modeling code – exploit symmetry of forces. So I spent some time exploring ways to take advantage of the symmetry of gravitational forces, without much success. One problem is that the gravitational computation is so lightweight that doing twice as many is faster than saving the work! AMBER does its calculations on 32×32 tiles (to correspond to CUDA’s warp size), and the source code on github includes an implementation that mirrors that strategy. It is sufficiently slower that I don’t even cover the source code in the book; as a compromise, I describe the strategy in the beginning of the chapter that gives an overview of the computation.

I’m cautiously optimistic that some method of exploiting symmetric forces will bear fruit, even for fairly lightweight calculations, but I wasn’t going to let that perfect be the enemy of the book’s done.

There are a couple of self-indulgent topics in the book: float->half conversion, normalized correlation, an SSE-optimized N-body implementation. But those were all included for good reasons. I still think the float->half conversion (which may seem out of place in the Streaming Multiprocessors chapter) is one of the best ways for developers to learn about floating point precision and rounding. Normalized correlation is the only application in the book that combines texturing and shared memory. And the SSE-optimized N-body implementation gives a stark illustration of how much easier CUDA is to program, while still yielding better performance. Even when biasing the results in favor of the CPU, N-body is still 8x faster on GK104 then on a dual-socket Xeon E2670 machine. (Porting to AVX might double performance on the CPU side, but by the same token, running on a GK110 might double performance on the GPU side, and using a faster GPU is a whole lot less software work than porting to AVX.) The reported result may not make NVIDIA happy (since I am reporting an 8x improvement, not 400x) and may not make Intel happy (since it underscores the difficulty of SIMD coding), but I think it puts CUDA in a positive light.

The book can be preordered on

* Black-Scholes was the workload that I used to prove out mapped pinned memory when we added it for the chipset team in CUDA 2.2, and we were pleasantly surprised to discover that GT200 also got a big benefit. (The architect who’d invested a lot of effort into making sure GT200 would be good at system memory rendering was gratified, but not surprised.)