Ten Years Later: CUDA Succeeded Despite…

After posting a list of reasons why CUDA succeeded, it seems worthwhile to reflect on some of its apparent vulnerabilities, and why CUDA has been successful despite those issues.

CUDA Succeeded Despite…

1. Being Proprietary.

NVIDIA builds the hardware and software to run CUDA applications and has never licensed the technology to anyone else. Conventional wisdom in the industry holds that proprietary software technologies are doomed to failure – they don’t get shepherded well by a single owner, and they don’t gain adoption by developers. But by making CUDA software portable to everything from Linux to Windows to MacOS, and making CUDA hardware available in a broad range of products from SOCs (Tegra) to high end servers (DGX-1), NVIDIA has staved off the risks they incurred by going it alone.

2. Explicit Memory Management.

It’s every new CUDA programmer’s rite of passage: As if allocating and copying input and output data to and from device memory weren’t enough trouble, developers also explicitly manage shared memory to facilitate data interchange between threads.

Fortunately for NVIDIA, due to the First Law of CUDA Development, developers haven’t been fazed by the need to learn these idiosyncrasies.

3. Limited Cache Coherency.

Some rules of thumb have been internalized by hardware designers to such a degree that they are not so much sound engineering practices, but religious edicts. One such rule is that caches have to be coherent. All the time. In hardware.

But CUDA is pervaded by violations of this tenet. Device memory is not coherent with host memory. Shared memory effectively resides in a separate address space, so isn’t coherent in the same sense as an L1 cache. Constant and texture memory are not coherent with device memory, and when changes are made to the memory, the illusion of coherence is maintained via software invalidation. As with explicit memory management, developers are willing to treat the lack of cache coherency as a cost of doing business – as long as they get the performance they crave.

4. Limited PC market share.

Discrete GPUs only occupy about 25% of PC market share by unit volume, and NVIDIA competes with AMD in that space. NVIDIA’s limited market share helps explain why CUDA has had limited success achieving developer adoption in packaged PC software, even when there’s a good fit with the software requirements.

Put yourself in the shoes of an engineering director at (say) Adobe. “Port this code to CUDA,” says NVIDIA, “and it will run much faster… on 18% of your potential customers’ machines.” Even that proposition is sketchy when accounting for the costs and benefits of supporting the full range of CUDA GPUs extant.

But for vertical applications (think HPC), CUDA developers build data centers with thousands of identical servers. And for embedded applications (think automotive), every GPU in a given design win has identical properties. In both cases, developers have a fixed hardware target to develop against, and they get a compelling return on the engineering investment of the CUDA port.

In the longer term, companies like Adobe and Autodesk should be able to gain the same benefits by transitioning to cloud-provisioned GPU platforms.

Ten Years Later: Why CUDA Succeeded

CUDA first became available about 10 years ago, so it seems like a good time to take note of its success and reflect on why it has been successful.

1. GPUs are not CPUs.

What I mean by this is not just that you don’t have to recompile your app (this point gets its own bullet later in this article), but that core operating system changes are not needed for GPU support. GPUs are complicated peripherals, but when the rubber meets the road, they are still just peripherals. They hang off the bus, get enumerated by the OS, get a driver loaded, and go. Proponents of competing technologies such as the Cell processor or Larrabee (now Xeon Phi) would have you believe otherwise, but GPUs have been served well by the flexibility and platform portability that comes with being a “dumb peripheral.”

2. GPUs are everywhere.

Jensen Huang has said the GPU had a “day job.” NVIDIA had an established, high-volume market for their ASICs. The overlap in requirements between a big, fast graphics chip and a general-purpose manycore processor was significant, but it wasn’t obvious to all that the incremental cost would be worth it. I personally had lunchtime arguments with senior graphics architects at NVIDIA who didn’t want to spend 10% die area on compute (the estimated hardware cost of adding support for scatter/gather and shared memory) because it would put them at a disadvantage running graphics benchmarks against AMD (at the time, it was known as ATI). Fortunately for NVIDIA, those skeptics were overruled and the business risk turned out to be justified.

Another way to look at it: though NVIDIA was weighing a 10% die area risk, technologies like Cell and Larrabee/Xeon Phi, or companies like Ageia and other coprocessor vendors, were incurring a 100% die area risk. They did not have an established market to fall back on if things didn’t work out.

3. GPUs are compellingly faster than the CPU.

Shortly after one of our first, best customers for CUDA received his first CUDA-capable GPU, he contacted NVIDIA with a question. He had gotten a sample workload ported, and, he said, it looked like it was working. The problem? He wanted to know how it could be so fast!

The senior people at NVIDIA had long known GPU performance was going to be amazing. Shortly after I joined NVIDIA in 2002, I had lunch with a senior NVIDIA architect and asked him what he was working on. “NV50,” he said. (Mind you, this conversation occurred before NV30 had taped out.) “It will unify vertex and pixel shader processing. We’ll have room to build a chip with about a teraFLOPS of processing power, but we’ll spend half the area on graphics so it will have peak performance of about 500 GFLOPS.” Later, in an internal company email, the same architect said NV50 was going to “make the CPU look like a toy.”

His prediction turned out to be amazingly accurate, considering it was made four years and two major architectural revisions in advance. NV50 turned into G80, the first CUDA-capable chip, and had 384 GFLOPS of peak performance – within spitting distance of his casual lunchtime conjecture.

Remember that when CUDA first shipped, Intel’s floating point capabilities were much more limited than they are today. The SIMD width was only 128 bits (Sky Lake currently supports 512), and Intel had only recently widened the actual execution unit (singular – modern Intel CPUs have multiple SIMD execution units) to a full 128 bits. Before the Core 2 Duo, one generation after another of Intel CPUs had supported SSE as two micro-ops (“high” and “low”) for the 64-bit-wide execution unit, limiting instruction throughput. In fact, CUDA may have prompted Intel to dramatically improve their floating point capabilities.

Today, it is still true that for suitable workloads, GPUs are compellingly faster than CPUs. Intel has doubled the SIMD width in their processors twice, and also doubled the number of SIMD execution units, but in that time, NVIDIA has increased the number of transistors in their “win” GPU by 30x (from 684M to 21B), with a commensurate increase in performance. NVIDIA GPUs, by the way, still benefit from Dennard scaling because they target much lower clock rates than CPUs. In 2006, G80 ran at <600 MHz, while the latest GPU (V100) runs at 1455 MHz. NVIDIA also has led CPU vendors in advancing their instruction set support, being the first to add FP16 and fused multiply-add support. For these reasons, NVIDIA has held off Intel’s attempts to close the performance gap over the last 10 years.

4. CUDA has a low barrier to entry.

On the hardware side, this point goes hand in hand with how the GPUs already had an established, high-volume market. A CUDA GPU could be had for well under $1000, and as an added bonus you got to play World of Warcraft on a badass gaming card. Later, CUDA GPUs found their way into laptops. Still later, CUDA GPUs can be rented on an hourly basis in the cloud with a credit card.

So the barrier to entry to acquire hardware always has been low. But the same is true of Intel CPUs – they are inexpensive and everywhere. But unlike Intel, who charges for their vectorizing compilers, NVIDIA wisely chose not to charge for the toolchain. CUDA has always been free to download, and NVIDIA has never charged royalties to use it.

It’s hard to beat free, and when it came to hardware, it was hard to beat a GPU. With such a low barrier to entry, it is no wonder developers flocked to it.

5. CUDA is as easy to program as SSE/AVX.

I devote a whole chapter to this point in The CUDA Handbook, but it bears repeating. The portions of an application that are most amenable to CUDA acceleration are, for the most part, the same as for SIMD instruction set optimization. In either case, only a small portion of the application – certainly less than 10%, and in some applications, as little as 2% – needs to be ported to yield a benefit. So the question becomes, which technology gives the biggest return on the engineering investment?

Let’s pause for a moment to reflect on two things. First, Intel had a 10-year head start on NVIDIA in building compilers for their respective target technology (SSE versus CUDA). For Intel, that investment was in vectorizing compilers – compilers that examine scalar code and emit executable code that uses SIMD instructions. Second, despite that head start, that investment has delivered a limited return – partly because, as already mentioned, only small parts of an application actually benefit from SIMD optimizations, but also because vectorizing compilers have never fulfilled their promise. See for example this GDC 2015 presentation by Andreas Fredericksson. The game development company where he works avoids vectorizing compilers because an innocent-seeming change can cause the vectorization to break – a potentially catastrophic setback when most games have to be done in time for the holiday season (“This is what will happen two days before gold.”) Instead, they use compiler intrinsics, which use functions with names like _mm_add_ps()  to operate on special types with names like __m128. With few exceptions, these functions have direct analogs to machine instructions (in the case of _mm_add_ps(), the SSE instruction is ADDPS). From an engineering standpoint, intrinsics enable developers to take advantage of the new instructions without worrying about register allocation, instruction scheduling, or the intricacies of the ABI. (An especial challenge on x86-64.)

In stark contrast, CUDA lets you write scalar-looking code that alludes to the parallelism by referencing built-in variables such as threadIdx and blockIdx. I’d call the memory management issues a wash – in CUDA, you have to allocate and copy to and from device memory, but SIMD instructions have alignment restrictions and do everything 4 or 8 or 16 things at a time in a way that makes it difficult to deal with edge cases. I admit to being biased, but I have written a great deal of both types of code and I consider CUDA at least as easy to target.

6. CUDA has superior performance portability.

Performance portability is the idea that code will not just run correctly, but deliver high performance against a variety of platforms. For CUDA, performance portability within a given GPU generation is a given, as long as applications launch enough thread blocks to saturate the largest GPU. Performance portability across GPU generations is a bit sketchier, but has held up over time. Even features like FMAD (fused multiply-add) were added seamlessly, and always had native compiler support. NVIDIA has changed architectures and instruction sets with high frequency, but masks those architectural differences with a sophisticated mix of driver and compiler software.

On multicore CPUs, developers pursue performance along two axes: multithreading and SIMD. For multithreading, major operating systems have very different operations to manage threads and synchronization. Mutexes, semaphores, and events were all built into Windows; condition variables were in Linux, and added to Windows in Windows Vista. Windows also added reader-writer locks, mutexes that can accommodate multiple threads when the resource is being accessed in a read-only manner. When you add in the instruction-level support for thread synchronization (“interlocked exchange” or “compare and swap” primitives can be used to implement any number of thread synchronization primitives – especially the so-called “lockless” data structures), the number and variety of options for developers is overwhelming. No wonder process-level parallelism (i.e. eschewing threads entirely) has become a popular method of leveraging multicore CPUs!

On the SIMD side, Intel has added instructions about every 2 years, and increased the SIMD width twice since 1999. But software developers can’t immediately use new instructions without qualification. For one thing, since only new CPUs include the new instructions, applications must test which instruction set level is available, and run the corresponding code path. Applications must support “downlevel” hardware that corresponds to the installed base owned by their target users (notably, this calculation is different for a supercomputing data center as opposed to a consumer application such as Photoshop). One interesting data point: CCP, the company that makes the popular online game EVE Online, did not start requiring SSE2 on EVE clients until 2011. SSE2 first became available in 2001!

So for every instruction set innovation – notably AVX, AVX2, and now AVX-512 – new code must be written, along with detection code to ensure the “best” code paths are executed on the various flavors of CPU. If intrinsics are the developer tool of choice, the development burden grows linearly in the number of supported instruction set permutations. If you want both SSE and AVX implementations, you write twice as much code, and so on. But even that understates the burden of supporting a plethora of instruction sets, because we haven’t yet accounted for the QA burden. The QA department can’t get away with just running the code on CPUs that support all of the available instruction sets; they have to make sure the code is tested on CPUs that don’t support all of the target instruction sets. Otherwise, the QA process will overlook bugs in the detection code – the code that decides which code path to run, depending on CPU capabilities. Unless you are testing on hardware that doesn’t support the latest instructions, an SSE2 instruction (say) may find its way into your SSE code paths. And because newer CPUs  also support the older instructions, they will run that buggy code just fine. But on older CPUs, when they encounter the instruction they don’t support, they throw an exception and the application crashes.

Efforts to address the performance portability of multithreading and SIMD have been desultory at best. If you take the intersection of threading primitives across operating systems, you get something that resembles C++’s std::thread – useful only to the simplest of parallel applications. For SIMD, rather than vectorizing compilers, the technologies that offer the best prospect at performance portability are domain-specific languages like Halide – which also has a CUDA implementation.

7. You don’t have to recompile your app.

The siren song of parallel technologies has echoed through the years: “Just recompile your app!” The marketing folks would have you believe that all the latent benefits of parallelism will be laid bare by their magical compilers. The problem is that 95+% of the application won’t benefit at all, so much of that porting effort is for naught. Think about the millions of lines of code in a flagship application from a company like Adobe or Autodesk. Do you really think the engineering manager of such an application is excited at the prospect of having to port and re-test millions of lines of code that implement the user interface, file parsing, and other portions that won’t run any faster? What about interoperability with the installed base of third party plug-ins? The last time mainstream developers undertook full ports of their applications, it was for 64-bit addressing.

With CUDA, developers port the small percentage of an application that can benefit. The rest of the application stays the same. If it runs on systems without CUDA hardware, QA managers have to test both code paths, and make sure to test the variety of CUDA hardware that may run the application. It is nontrivial, but it’s a much smaller pill to swallow than having to recompile the entire application.

 

There you have it. As a final note, notice that whether the list is prioritized from top to bottom or the other way around, CUDA GPUs’ status as a peripheral (not a CPU) is a central reason they have been so successful.

Warp Synchrony and The First Law of CUDA Development

One of the most overlooked developments of GTC2017 was that NVIDIA’s Architecture Team has finally Had It Up To Here with developers who write warp synchronous code. As you may know, warp synchronous code relies on the way CUDA hardware executes 32-thread warps in lockstep. The CUDA Handbook contains some examples of warp synchronous code. In the reduction chapter, for example, warp synchronous code is used to optimize performance of the last 5 iterations of this loop that accumulates partial sums in shared memory:

for ( int activeThreads = blockDim.x>>1;
          activeThreads;
          activeThreads >>= 1 ) {
        if ( tid < activeThreads ) {
            sPartials[tid] += sPartials[tid+activeThreads];
        }
    __syncthreads();
}

Notice that every iteration of the loop is accompanied by a call to __syncthreads(), the intrinsic that serves as block synchronization primitive and memory barrier. The unrolled, warp synchronous implementation of the last 5 iterations looks like this:

if ( threadIdx.x < 32 ) {
    volatile int *wsSum = sPartials;
    if ( blockDim.x > 32 ) wsSum[tid] += wsSum[tid + 32];
        wsSum[tid] += wsSum[tid + 16];
        wsSum[tid] += wsSum[tid + 8];
        wsSum[tid] += wsSum[tid + 4];
        wsSum[tid] += wsSum[tid + 2];
        wsSum[tid] += wsSum[tid + 1];
    if ( tid == 0 ) {
        volatile int *wsSum = sPartials;
        out[blockIdx.x] = wsSum[0];
    }
}

The volatile keyword represents NVIDIA’s grudging acceptance of warp synchronous code. Historically, volatile is a keyword that hints to the compiler not to optimize out memory traffic through the associated pointer. The classic application is for device drivers for hardware with memory-mapped hardware registers, where reads and writes to “memory” are used to program the hardware. But volatile doesn’t give the compiler enough information; although it inhibits optimizations such as reusing registers or conserving memory writes, it’s not expressive enough to capture the synchronization semantics required when threads within a warp can diverge.

As a result, with Volta’s improved support for divergent code execution, NVIDIA is giving up on the volatile keyword workaround and deprecating all warp-level primitives. Instead, developers are encouraged to use new intrinsics with “_sync” appended. So instead of calling any(), the function that returns True if the input predicate expression is true for any of the 32 threads in the warp, we are to call any_sync().  The new function may be invoked on older hardware, and I suspect they are synonyms for the older functions; but on Volta, it likely will enforce semantics that converge execution across the warp.

After listening to the presentation at GTC, I sought out an NVIDIAn and told them that CUDA developers have always known that warp synchronous coding wasn’t strictly correct. NVIDIA has been finger-waggling at CUDA developers who write warp synchronous code for years! To gain some insight into why developers do it anyway, we turn our attention to a completely unscientific survey of developers where they were asked why they write CUDA code: FirstLaw_1Figure 1. Motivations for CUDA Development

I call this the First Law of CUDA Development: Performance is CUDA’s raison d’être. No one writes CUDA code for fun. Every CUDA user is trying to get a return on investment in the form of higher application performance. The reason developers write warp synchronous code even though it’s the “wrong” thing to do is because it is faster. Put another way, sprinkling __syncthreads() calls that turn out to be superfluous is… well… slower. (A subtler implication is that if the behavior does not change, it is harder for developers to tell which __syncthreads() calls are superfluous). Developers always want to do the right thing, I told the NVIDIAn; but ultimately, if you want developers doing the right thing, you have to make the right thing also be the fastest thing.

During the course of the conversation, the NVIDIAn defended the idea that they should break warp synchronous code in the future: “If I warn you to look both ways before you cross the road, don’t blame me if you get hit by a car.” I told him: “If that is your position, it’s your responsibility to make sure that developers who don’t look both ways ALWAYS get hit by a car.”

Why does CUDA CUdeviceptr use unsigned int instead of void?

This question on StackExchange was put on hold as primarily opinion-based: “…answers to this question will tend to be almost entirely based on opinions, rather than facts, references, or specific expertise.”

The content of StackExchange is usually high quality, but in this case, while the design decision was based on opinion, the answer to the question needn’t be… you just need to ask the people who know! And the inimitable talonmies, who is poised to crack 30k on StackExchange’s points-based reputation system, compounded the problem by saying that CUdeviceptr is a handle to a device memory allocation, not a pointer.

I don’t think I have ever seen talonmies give an incorrect answer before; but in this case, he’s off the mark. CUdeviceptr always has represented a pointer in the CUDA address space. In fact, though it was frowned upon to mix driver API and CUDA runtime code, even in CUDA 1.0 you could transform between CUDART’s void * and the driver API’s CUdeviceptr by writing something like:

void *p;
CUdeviceptr dptr;
p = (void *) (uintptr_t) dptr;
dptr = (CUdeviceptr) (uintptr_t) p;

We could have made device pointers void *, but there was a desire to make it easy for compilers to distinguish between host and device pointers at compile time instead of runtime. Furthermore, SM 1.x hardware only supported 32-bit pointers, so using void * would have created a difference in pointer size on 64-bit host platforms. It’s a long-distant memory now, since so much great compiler work has gone into CUDA since then, but at the time “pointer-squashing” (having CUDA’s compiler transform 64-bit pointers into 32-bit pointers on 64-bit host systems) was a big issue in early versions of CUDA.

For the record, not making the driver API’s device pointer type void * is one of my bigger regrets about early CUDA development. It took months to refactor the driver to support 64-bit device pointers when hardware support for that feature became available in SM 2.x class hardware.

In fact, some weeks before we released CUDA 1.0, we had a meeting and a serious discussion about replacing CUdeviceptr with void *, and decided not to take the schedule hit. We weren’t going to let perfect be the end of done, and we paid the price later.

While we’re on the topic of regrettable design decisions in early CUDA, I wish I had done a search-and-replace to convert cuFunction to cuKernel, and put cuLaunchKernel in the first release (in place of the stateful, chatty and not-thread-safe cuParamSet* family of functions). But we had scant engineering resources to spend on fit and finish, a constraint that is no less true for CUDA than for many other successful software projects in history.

Floating point: CPU and GPU differences

Some time ago, I wrote this in response to a StackOverflow question, but wanted to share here on the blog.

The question basically asked how you could make a floating point operation the same between the CPU and the GPU, and here is an updated version of the answer:

There are many reasons why it is not realistic to expect the same results from floating point computations run on the CPU and GPU. It’s much stronger than that: you can’t assume that FP results will be the same when the same source code is compiled against a different target architecture (e.g. x86 or x64) or with different optimization levels, either.

In fact, if your code is multithreaded and the FP operations are performed in different orders from one run to the next, then the EXACT SAME EXECUTABLE running on the EXACT SAME SYSTEM may produce slightly different results from one run to the next.

Some of the reasons include, but are not limited to:

  • floating point operations are not associative, so seemingly-benign reorderings (such as the race conditions from multithreading mentioned above) can change results;
  • different architectures support different levels of precision and rounding under different conditions (i.e. compiler flags, control word versus per instruction);
  • different compilers interpret the language standards differently, and
  • some architectures support FMAD (fused multiply-add) and some do not.

Note that for purposes of this discussion, the JIT compilers for CUDA (the magic that enables PTX code to be future-proof to GPU architectures that are not yet available) certainly should be expected to perturb FP results.

You have to write FP code that is robust despite the foregoing.

As I write this today, I believe that CUDA GPUs have a much better-designed architecture for floating point arithmetic than any contemporary CPU. GPUs include native IEEE standard (c. 2008) support for 16-bit floats and FMAD, have full-speed support for denormals, and enable rounding control on a per-instruction basis rather than control words whose settings have side effects on all FP instructions and are expensive to change.

In contrast, CPUs have an excess of per-thread state and poor performance except when using SIMD instructions, which mainstream compilers are terrible at exploiting for performance (since vectorizing scalar C code to take advantage of such instruction sets is much more difficult than building a compiler for a pseudo-scalar architecture such as CUDA). And if the wikipedia History page is to be believed, Intel and AMD appear to have completely botched the addition of FMAD support in a way that defies description.

You can find an excellent discussion of floating point precision and IEEE support in NVIDIA GPUs here.