When the time came for us to gather requirements and start designing CUDA 2.0 (c. 2008), I was surprised to hear from our product manager that the key Next Biggest Thing in CUDA was… drum roll… 3D textures[1]!
The reason this priority was surprising is because adding texturing in the first place had been somewhat controversial. Why have graphics features in a general purpose computing ecosystem[2]?
Early versions of CUDA hardware did not have caches to mitigate the overhead of uncoalesced memory traffic, though, so applications that really needed misaligned reads had no choice but to use the texturing hardware. As an added bonus, the cache resources were separate, so by reading inputs via the texture pipeline, you were enlisting hardware resources that otherwise would have gone unused.
But… 3D texturing? Okay. That would require us to implement extensive, backward-compatible revisions to all of the CUDA array APIs, the texturing APIs, and the memcpy APIs. CUDA 1.0 already had a plethora of memcpy functions, especially at the driver API level, where we’d opted to try to maintain type safety between CPU and GPU pointers.
Adding new functionality in a backward compatible way is a long-held tradition in API design. For example, flags words are a good way for interface designers to future-proof their designs; or in a pinch, replacement functions may be provided that enable new functionality. Since those functions did not exist when legacy applications were being developed, the interface designer knows they can only be called by newer application expecting the new behavior. Sometimes we get both, even in close quarters: the driver API function cuEventCreate() takes a flags word that later was used for innovations such as blocking synchronization, but whomever wrapped the driver API in the CUDA Runtime omitted it from cudaEventCreate(), so NVIDIA later added the replacement interface with the awkward name cudaEventCreateWithFlags()!
The reason API designers have to be careful with this type of thing is because applications can rely on existing behaviors, a lesson that most interface designers learn through hard experience. Mine was acquired at Microsoft in the multimedia division…
Aside: How To Compatible
When I was Direct3D development lead in 1998 or so, we worked with Intel to enable a new software rasterizer that benefited from the new MMX instruction set. The reason we had to expose it as a new rasterizer, instead of simply replacing the incumbent RGB rasterizer with a faster one, was because applications would subtly break when using the new rasterizer, and no matter how hard Intel tried, they couldn’t just replace the old, slower, serial RGB rasterizer with a new, faster one. So, we added the MMX rasterizer as an option; that way, we had both the new MMX rasterizer and backward compatibility (old applications would continue to work). The only downside was that newer applications had to specifically ask for it, but Intel was okay with that, as a measure of accountability for not being able to develop a drop-in replacement for the RGB rasterizer. Such compatibility measures are called “opt-ins” because the application is opting into the new behavior: if a legacy application were ported to the newest interfaces without any changes, it still would get the old RGB rasterizer. A change to the source code, not just a recompile, was needed to get the new MMX rasterizer.
But when we simply started enumerating the MMX rasterizer to applications, we found a lot of compatibility breaks. Many applications ran much more slowly, and others crashed or otherwise didn’t work at all. When we debugged the problem, we found that we had to add a few more levels of opt-in to keep legacy applications working. You see, Direct3D always had reported 2 or 3 rasterizers: the “Ramp” rasterizer that ran quickly; the “RGB” rasterizer that was more functional, but slower; and, if hardware was available, the “HAL” rasterizer made hardware-acceleration available[3]. But by adding the MMX rasterizer, we’d increased the number of rasterizers in the API from 3 to 4 if hardware acceleration was available; and several different compatibility breaks became possible:
- Some applications had statically declared a 3-element array of rasterizer handles that was now being overflowed by our 4-element response, causing the application to crash;
- Some applications were assuming that if 3 rasterizers were enumerated, then hardware must be available;
- Some applications were further assuming that if 3 rasterizers were enumerated, then the third must be the HAL[4].
The fix was to make enumeration of the MMX rasterizer an implicit opt-in if the application was accessing Direct3D using a new interface. We also wrote better sample code – that was where the statically declared arrays had come from[5].
After being burned a few times by promising that existing applications will continue to work on new operating systems (or new versions of Direct3D) and delivering disappointing results, interface designers start to learn tactics to make everyone’s jobs easier[6], not just their own, but also the lives of developers who are using their interfaces. With that in mind, let’s take a close look at the 3D memcpy APIs that were added in CUDA 2.0.
An API Design Headache
Set aside the complexities of 3D texturing for a moment and focus on the difficulties presented by just copying the memory from here to there. For starters, CUDA has three (3) different memory types: host memory[7], device memory, and CUDA arrays. Since CUDA arrays come in 2D and 3D variants, we also needed to support 1D, 2D and 3D memory copies. And because CUDA arrays’ layout is opaque to developers, the memory copy interface must support offset values (X, Y and Z).
Designing a set of distinct memory copy functions that covered the full Cartesian product of possibilities would have been an API design and development nightmare, accompanied by a developer education headache[8]. Between 3 source memory types, 3 destination memory types, and 3 dimensionalities for the memcpy’s, twenty-seven (27) memory copy functions would have had to be designed, implemented, and documented. Just coming up with a heuristic for naming would be a chore, reminiscent of the naming problem confronted by the designers of SIMD intrinsics.
For example, a function to copy a 2D slice of device memory into a 3D CUDA array might look something like this:
CUresult cuMemcpy2D_2DDto3DA( CUarray dstArray, size_t offsetX, size_t offsetY, size_t offsetZ, CUdeviceptr src, size_t srcPitch, size_t WidthInBytes, size_t Height );
The offset parameters are needed for CUDA arrays because the layout is opaque to developers. In this interface design, callers who are copying from a subset of a 2D array in device memory are expected to do the address arithmetic to specify the correct base pointer (for type T, this could be computed with an expression such as ((int8_t *) base)+srcY*srcPitch+srcX*sizeof(T)
– perfectly intuitive)[9].
An alternative design would be a single entry point, with parameters designed to be ignored if not relevant (say, the height parameter if a 1D memory copy is being requested), would be more tractable. Such a function might look like this:
CUresult cuMemcpy3D( void *dstHost, CUdeviceptr *dstDevice, size_t dstPitch, CUarray dstArray, size_t dstXInBytes, size_t dstY, size_t dstZ, const void *srcHost, size_t const CUdeviceptr dstDevice, size_t srcPitch, CUarray srcArray, size_t srcXInBytes, size_t srcY, size_t srcZ,size_t WidthInBytes, size_t Height, size_t Depth );
This function has a dizzying number of parameters (17, in fact), but it does cover all of the use cases. The memory types of the source and destination would be inferred from whichever of srcHost/srcDevice/srcArray and dstHost/dstDevice/dstArray
was non-NULL. (If more than one are specified, the function should fail.) The offset parameters are needed for CUDA arrays, since the layout is opaque to developers; the API designer then would have to decide whether to respect them if they are non-zero for host and device memory, or specify that they are ignored unless the participating memory type is a CUDA array. Finally, specifying a height of 0 and/or a depth of 0 naturally describes a 1D or 2D memory copy.
The problem with this function is that a single call to it occupies at least five lines of code to specify the 17 parameters. This interface is cumbersome at best! But, a better alternative to this mega-function is available, and that’s what we wound up shipping in CUDA: all of the parameters are collected together into a single structure that enabled the API client to separately describe the source and destination memory ranges, along with the location and size of the copy itself. The resulting structure had about two dozen members:
typedef struct CUDA_MEMCPY3D_st {
size_t srcXInBytes; /**< Source X in bytes */
size_t srcY; /**< Source Y */
size_t srcZ; /**< Source Z */
size_t srcLOD; /**< Source LOD */
CUmemorytype srcMemoryType; /**< Source memory type (host, device, array) */
const void *srcHost; /**< Source host pointer */
CUdeviceptr srcDevice; /**< Source device pointer */
CUarray srcArray; /**< Source array reference */
void *reserved0; /**< Must be NULL */
size_t srcPitch; /**< Source pitch (ignored when src is array) */
size_t srcHeight; /**< Source height (ignored when src is array; may be 0 if Depth==1) */
size_t dstXInBytes; /**< Destination X in bytes */
size_t dstY; /**< Destination Y */
size_t dstZ; /**< Destination Z */
size_t dstLOD; /**< Destination LOD */
CUmemorytype dstMemoryType; /**< Destination memory type (host, device, array) */
void *dstHost; /**< Destination host pointer */
CUdeviceptr dstDevice; /**< Destination device pointer */
CUarray dstArray; /**< Destination array reference */
void *reserved1; /**< Must be NULL */
size_t dstPitch; /**< Destination pitch (ignored when dst is array) */
size_t dstHeight; /**< Destination height (ignored when dst is array; may be 0 if Depth==1) */
size_t WidthInBytes; /**< Width of 3D memory copy in bytes */
size_t Height; /**< Height of 3D memory copy */
size_t Depth; /**< Depth of 3D memory copy */
} CUDA_MEMCPY3D_v2;
To call this API, typically the developer declares one of these structures on the stack, then passes the structure to the API itself[10]:
CUresult cuMemcpy3D( const CUDA_MEMCPY3D *cp );
Soon after I checked in the first draft of this API, I got an irate email from a fellow NVIDIAn who thought it was way too verbose. The exact wording is lost in the mists of time, but I remember a lot of heat! What he didn’t understand from a superficial reading of the API was that calling this function would only be about as complicated as the operation being requested.
For example, a simple 1D memcpy from host to device memory may be implemented as follows:
CUresult
cuMemcpyHtoD_via3D( CUdeviceptr dst, const void *src, size_t bytes )
{
CUDA_MEMCPY3D cp = {0};
cp.dstMemoryType = CUDA_MEMORYTYPE_DEVICE;
cp.srcMemoryType = CUDA_MEMORYTYPE_HOST;
cp.dstDevice = dst;
cp.srcHost = src;
cp.WidthInBytes = bytes;
return cuMemcpy3D( &cp );
}
The declaration of cp
zero-initializes the structure using a C programming construct that was valid in the 1970s, when the First Edition of the White Bible came out. As long as the interface has defined reasonable defaults to be zero, (such as the source and device pointers and offsets), callers can use this idiom to write compact, intuitive code.
A more complicated memory copy, like the 2D memcpy function we declared earlier, could be implemented as follows:
CUresult cuMemcpy2D_2DDto3DA( CUarray dstArray, size_t offsetX, size_t offsetY, size_t offsetZ, CUdeviceptr src, size_t srcPitch, size_t WidthInBytes, size_t Height )
{
CUDA_MEMCPY3D cp = {0};
cp.dstMemoryType = CU_MEMORYTYPE_ARRAY;
cp.dstArray = dstArray;
cp.dstXInBytes = offsetX;
cp.dstY = offsetY;
cp.dstZ = offsetZ;
cp.srcMemoryType = CU_MEMORYTYPE_DEVICE;
cp.srcDevice = src;
cp.srcPitch = srcPitch;
cp.WidthInBytes = WidthInBytes;
cp.Height = Height;
return cuMemcpy3D( &cp );
}
The structure-based interface delivers the best of both worlds: the caller can express any valid operation, but the function invocation is only as complicated as the operation requested! The main downside is that when implementing such an interface, parameter validation can be Byzantine in complexity. As a result, function call latency can become a concern, as small operations may spend more time validating the call than performing the requested operation. NVIDIA resolved this issue with CUDA graphs, which can record operations for later playback and enables parameter validation to be done at the recording stage.
With benefit of hindsight, there is one change to the interface that I wish I had thought to make before we shipped. When I designed the interface, I knew that assigning zero to the structure was valid K&R C from the 1970s – handy when zero is a good default value for most parameters! But if I’d eaten my own dogfood just a bit more, I might have thought to put the memory types at the beginning of the structure. They always are needed – you can’t specify a memory copy without specifying the types of the memory participating – and the developer education would’ve been even more intuitive, with the destination and source memory types specified at the site of the structure declaration. This idiom takes advantage of another subtlety of K&R C, namely that if only the first few members of a structure are initialized, the remainder of the structure is zero-initialized. Here’s a rewrite of our 2D-device-to-3D-array memcpy function that does just that:
CUresult cuMemcpy2D_2DDto3DA( CUarray dstArray, size_t offsetX, size_t offsetY, size_t offsetZ, CUdeviceptr src, size_t srcPitch, size_t WidthInBytes, size_t Height )
{
CUDA_MEMCPY3D cp = {CU_MEMORYTYPE_ARRAY, CU_MEMORYTYPE_DEVICE};
cp.dstArray = dstArray;
cp.dstXInBytes = offsetX;
cp.dstY = offsetY;
cp.dstZ = offsetZ;
cp.srcDevice = src;
cp.srcPitch = srcPitch;
cp.WidthInBytes = WidthInBytes;
cp.Height = Height;
return cuMemcpy3D( &cp );
}
Named Parameters
C++20 gives us an even more intuitive way to access such interfaces, first by enabling the parameter structure to be declared inline with the function invocation (not always what the developer wants, of course), but also by enabling the structure members to be named as they are initialized. In C++20, echoing the C++ requirement that constructors of embedded members be invoked in the order of appearance of said members, the named structure members must be specified in order of appearance. The resulting wrapper function isn’t much more intuitive-looking than the 1970s-era K&R edition, but your mileage may vary.
CUresult cuMemcpy2D_2DDto3DA( CUarray dstArray, size_t offsetX, size_t offsetY, size_t offsetZ, CUdeviceptr src, size_t srcPitch, size_t WidthInBytes, size_t Height )
{
CUDA_MEMCPY3D cp = {
.dstMemoryType = CU_MEMORYTYPE_ARRAY,
.dstArray = dstArray,
.dstXInBytes = offsetX,
.dstY = offsetY,
.dstZ = offsetZ,
.srcMemoryType = CU_MEMORYTYPE_DEVICE,
.srcDevice = src,
.srcPitch = srcPitch,
.WidthInBytes = WidthInBytes,
.Height = Height};
return cuMemcpy3D( &cp );
}
Next: we’ll take a look at applying this API design methodology to BLAS.
[1] For reference, CUDA 1.1 added streams and events, which have proven to be useful and future-proof over the last 15 years.
[2] They must not have minded that much, because cubemaps and mipmapped textures were added sometime after I left NVIDIA. Maybe OptiX needed them.
[3] With benefit of hindsight, it seems ridiculous, but in the mid-1990s some game developers wanted to keep rasterization on the CPU, where they could retain complete control. They prided themselves on their ability to write optimized software renderers, and some were loath to get rid of their most valuable IP. Reportedly it took an in-person visit by 3Dfx to convince John Carmack that the novel visibility methods he’d developed for the Quake engine could be combined with hardware-accelerated rasterization to make an even better product.
[4] “Hardware abstraction layer,” a term that I believe originated with Dave Cutler’s original Windows NT kernel, whose HAL enabled the NT kernel to be portable across CPU architectures. Cutler’s HAL ran in kernel mode and abstracted operations like mapping page tables, and ours ran in user mode and abstracted operations like drawing triangles.
[5] We had shipped several version of DirectX before we realized that sample code should be written by the best developers available, since that is the code that developers copy and paste to build their own applications.
[6] Interface designers must account for their clients’ needs, or no one will use their APIs. The metric is ROI: the interface must deliver a return on the investment that justifies the cost of learning the interface. The first version of Direct3D that used “execute buffers” failed in this mission – it was too difficult to use, and the promised benefits of execute buffers were never delivered – and almost caused the API to be cancelled.
[7] The CUDA driver treats page-locked and pageable memory differently, but those differences generally are transparent to the API client.
[8] As it happens, the CUDA runtime did expose multiple entry points to cover some of the different use cases, for example, cudaMemcpy2D() and cudaMemcpy2DFromArray().
[9] When interface designers create work for their clients, you may hear that referred to as “reverse delegation.” Sometimes it is the best choice for an interface, especially if different clients may want to do the operation differently. In the case of computing offsets into 2D linear arrays in device memory, we decided to take offset parameters and do the pointer math on behalf of our client.
[10] Another decision we had to make for this function was whether to embed the CUstream
parameter for asynchronous variations. For orthogonality with the existing APIs for asynchronous memcpy, I kept both the functions and the stream parameter in separate functions, cuMemcpy3D() and cuMemcpy3DAsync().