Monthly Archives: August 2017

Managed Memory and Segmentation

At the GPU Technology Conference this year, I ran into an old colleague from NVIDIA and the topic of managed memory came up. He related that earlier in the conference, the Q&A session after a Dell-sponsored presentation on managed memory had suffered a serious decline in the level of discourse. Neither of us had been in attendance, but apparently one questioner after another stood up and asked the presenters, in effect, “Where are you going with this?”

He made it sound like the presenters thought they were going to be pelted with rotten fruit!

Why the controversy? Managed memory is supposed to make CUDA programming simpler. It is intended to eliminate the need to copy data back and forth between CPU and GPU memory. If it were performance-neutral like, say, C++ lambdas, then managed memory would be a welcome addition to CUDA. The problem is that any feature that presents a risk of degrading performance will be viewed with skepticism by the CUDA programming community – because no one programs CUDA for fun.

What is it about CUDA that makes it so difficult to automatically manage residency of memory?

Reflecting on the answer to this question brought me back to a conversation I had at a different GTC, so long ago that I’m not sure which year it occurred. I had just met Daniel Moth, the Program Manager at Microsoft for C++ AMP. Once it was firmly established that we were fellow travelers, charting the technical roadmaps for competing data parallel programming environments, he had a question.

“Tell me one thing,” he asked. “Why do you need streams and events?”

I had to think for a minute. Why indeed? I’d added the feature in CUDA 1.1, to cover new hardware that could DMA host memory concurrently with kernel execution; but it was already clear that the new abstractions were future-proof to multiple kernels executing concurrently, and even coordinating execution between multiple GPUs.

“Streams are like CPU threads,” I told him stupidly, quoting from the original design document I’d written in 2007. “Operations that are done in different streams can happen concurrently. And you need events to coordinate execution between streams.”

“But we don’t need that stuff in C++ AMP. The stuff that can be done in parallel, we just do it in parallel.”

After a few minutes’ conversation, the key difference emerged and I finally had it.

“Oh,” I cried. “CUDA has a flat address space!”

C++ AMP does not.

CUDA’s address space causes more trouble than is widely appreciated. Because pointers can be stored in device memory, any CUDA kernel can attempt to access any address. In early versions of CUDA, where paging is not supported (every byte of virtual memory is backed by physical memory) and systems with multiple GPUs were rare (and certainly not for sale in the public cloud), having an address space seemed to make sense. That impression was bolstered by the hardware design community’s ideological commitment to linear address spaces, which had taken root after a divisive debate contrasting linear address spaces with segmentation.

Linear Addressing versus Segmentation

Segmentation is the idea that memory should be modeled as a set of discrete buffers with base pointers and lengths, rather than assigning an address (like a PO Box) to each byte of memory. Segmented memory is accessed via a segment/offset tuple instead of by a single address, a paradigm that is implemented at the hardware level. Intel’s x86 architecture was segmented from the beginning (c. 1976). It provided for 4 segments to be accessible at any given time: the segment registers CS, DS, SS, and ES were for code, data, stack, and “extra” data, respectively. Each segment register had a base address and a length, and most machine instructions implicitly referenced a segment that represented a sensible default. The PUSH and POP instructions that operate on the stack implicitly referenced the stack segment (SS). Loads and stores from memory implicitly used DS, the data segment, unless that default was overridden by a “segment prefix” instruction. For example, the SS: prefix could be used to operate on stack memory.

The problem with segments was that they made code difficult to compose: even simple operations like function calls were complicated by potential differences between the segment register settings needed by the caller and callee. The callee could save and restore its segment registers at the subroutine boundary, but that hurt performance. More typically, developers would select a “memory model” with fixed segmentation usage that was appropriate for their application. So-called “large” memory models would just specify a segment:offset tuple for every address; under MS-DOS, this amounted to a cheesy way to enable 20-bit addressing with 32-bit addresses, or 1M of memory with 4G worth of address width. It also hurt performance since every load and store needed a segment override.

Segmentation introduced difficult, but solvable, problems for developers of individual applications; but even 25 years ago, it was clear that plugin architectures like OLE automation would play a central role in future software development. Being able to load code and data dynamically into an application and have it “just work,” without having to worry about segments, was of paramount importance. The ability for libraries to efficiently access their callers’ data, and process it on their behalf, overrode the concerns that buggy code could corrupt data that happened to be accessible.

Segmentation and flat addressing can be reconciled by enabling large segment offsets and having the operating system map all the segments to cover the same address range. This usage was anticipated when the Intel i386 was released in 1986, and implemented in 32-bit multitasking operating systems like UNIX (or Microsoft’s long-lost Xenix), and later, OS/2 and Windows NT. This paradigm was so popular, and the need for segmentation support in hardware so unclear, that AMD mostly did away with segment registers when they revised x86 to enable 64-bit addressing in the early aughts.

When I wrote the specification for CUDA textures, with a clear separation between memory and views on the memory (CUDA arrays and texture/surface references, respectively), it quickly became clear that CUDA arrays were effectively segmentation. A CUDA kernel can’t access just any CUDA array; the CUDA driver must predeclare the CUDA arrays to be accessed by a kernel. Coupled with other per-launch parameters, such as the amount of shared memory and the number of registers needed, a CUDA kernel launch more closely resembles a container launch than a subroutine call.

A key reason segmentation was an abject failure for general-purpose computer architectures was the high cost of “switching segments” on a per-instruction basis. On x86, instructions such as LDS (load data segment) were costly; instruction prefixes to change the segments being operated on by a given instruction added complexity; and naïve systems that kept segment:offset tuples for all pointers essentially wasted addressing bits. Now that we have 64-bit addressing, it is possible to envision having page tables play the role of segments (by introducing a byte-granular limit to page table size), as argued in this blog post. For now, however, there is a decisive consensus in favor of flat address spaces.

What does all this have to do with managed memory?

By implementing segmentation on a per-kernel basis instead of a per-machine-instruction basis, GPU computing technologies get many of the benefits of segmentation, without the costs that hindered adoption on the CPU side. Kernels may take slightly longer to launch than they would otherwise, but the cost of a kernel launch is high enough that the additional cost of segmentation is negligible. And if each kernel launch predeclares the needed segments, the system can infer residency requirements, ensure coherency, and identify parallelism opportunities, much in the same manner that superscalar CPUs use real-time dependency analysis to identify which instructions can execute in parallel.

What do you mean by “infer residency requirements,” you ask? You guessed it: managed memory!

What do you mean by “identify parallelism opportunities,” you ask? You guessed it: automatic CUDA streams!

What about coherency? Not much would change here. The CUDA driver already uses software mechanisms to enforce coherency, for example, by inserting cache-invalidate instructions into the command stream before launching kernels that read from texture. In a segmented memory architecture, read-only segments can be copied where they are needed, then discarded without having to worry about propagating changes to the data. Writeable segments could be copied back wholesale, or using dirty bit optimizations.

So, it is not hard to imagine a GPU computing technology that uses segmentation to manage memory rather than a linear address space. In fact, we do not have to imagine C++ AMP – it’s already here – and for CUDA, programs that used only CUDA arrays would have the properties needed to automate residency and parallelization. As a side note, the WDDM display driver model introduced in Windows Vista embraced a segmented memory architecture for paging.

Let’s review some of the deficiencies in managed memory, as discussed in my previous blog. It attempts to infer residency requirements based on memory accesses – which hurts performance and breaks the First Law of CUDA Development. It breaks the useful ability to infer the “owner” (CPU or which GPU) of a given address in the Unified Virtual Address Space. The semantics of multi-engine and multi-GPU memory management are complicated, and require hinting. Even if we set aside warranted skepticism about whether the hinting will be future-proof (I have my doubts), it introduces enough complexity that managed memory does not compare favorably to the static, affinitized allocations like CUDA 1.0 or segment-based architectures like C++ AMP.

Managed Memory: Belated Comments on Implementation

Managed memory is a new-ish CUDA feature that aspires to do away with the need to explicitly copy CPU memory to and from GPU memory. Introduced in CUDA 6.0, its initial implementation was unusably slow. (For example, copying managed memory from GPU to CPU memory ran at 512MB/s, 25x slower than an asynchronous memcpy.)

When they undertook to build the managed memory feature, NVIDIA had many different implementation strategies they could have pursued. As best I can tell, here is a summary of their implementation strategy:

  1. Upon allocation of managed memory, the CUDA driver allocates device memory, plus a pageable range of CPU memory at the same virtual address range.
  2. The CUDA driver use dirty page bits to track which 4K pages were “touched” by the CPU.
  3. Upon kernel launch, the CUDA driver would unmap the managed memory from the CPU and copy the dirty pages from the CPU to the GPU. Unmapping the CPU memory removes the risk of write-after-read hazards from the CPU corrupting managed memory before the GPU was able to copy it.
  4. While CUDA kernels run, the device memory copy of the managed memory is the only valid one.
  5. Upon CPU/GPU synchronization, the CPU buffer is made accessible again, but is not copied wholesale from GPU memory. It is possible the GPU’s hardware does not have the same dirty bit tracking facilities as the CPU, or perhaps NVIDIA just thought it would be preferable to copy device memory back to the CPU “on demand.”
  6. Copying managed device memory back to host memory is prompted by page faults: when the CPU attempts to access a page of managed memory, the CUDA driver handles the page fault by copying the 4K of GPU memory to CPU memory.

The application I used to investigate NVIDIA’s managed memory implementation is only about 60 lines of code. The key component is a function usPerLaunch that allocates a specified amount of managed memory, launches a NULL kernel, synchronizes with the GPU, then optionally “touches” the managed memory to force the CUDA driver to copy it back to host memory. (In an earlier version of this test, I confirmed that CUDA lazily copies only “dirty” pages in the other direction, as NVIDIA claims in its documentation.)

Here is the function in question – it may be found in cudahandbook/concurrency/managedOverhead.cu:

const size_t pageSize = 4096;
template
double
usPerLaunch( int cIterations, size_t cPages=0 )
{
    cudaError_t status;
    double microseconds, ret;
    chTimerTimestamp start, stop;
    void *p = 0;

    cuda(Free(0) );
    if ( cPages ) {
        cuda(MallocManaged( &p, cPages*pageSize ) );
    }

    chTimerGetTime( &start );
    for ( int i = 0; i < cIterations; i++ ) {
        NullKernel<<<1,1>>>();
        cuda(ThreadSynchronize() );
        if ( bTouch && 0 != p ) {
            for ( int iPage = 0; iPage < cPages; iPage++ ) {
                ((volatile unsigned char *) p)[iPage*pageSize] |= 1;
            }
        }
    }
    chTimerGetTime( &stop );

    microseconds = 1e6*chTimerElapsedTime( &start, &stop );
    ret = microseconds / (float) cIterations;
    cudaFree( p );
Error:
    return (status) ? 0.0 : ret;
}

I ran this program on a Haswell-based Windows 7 machine on two NVIDIA GPU boards: the NVIDIA GeForce GTX 970 and Titan X (GM200 and GP100, respectively). Although both are large “win” chips, I would expect similar test results to hold true across all Maxwell and Pascal GPUs, since they seem to have implemented a hardware interface that improved managed memory performance.

μs

Launch time (ms) Memory (KB) Bandwidth (MB/s)
47 0
105 4 39
104 8 78
115 16 143
134 32 244
213 64 307
381 128 344
649 256 404
1247 512 420
2221 1024 472
4712 2048 445
8458 4096 496
17041 8192 492
33992 16384 494

Table 1. GM200 results.

Launch time (ms) Memory (KB) Bandwidth (MB/s)
39 0 0
47.15 4 7
49.86 8 164
57.84 16 283
59.04 32 555
64.73 64 1012
79.08 128 1657
98.41 256 2664
137.15 512 3823
205.56 1024 5101
391.91 2048 5351
745.81 4096 5624
1543.91 8192 5433
3114.83 16384 5386

Table 2. GP100 launch results.

“Better,” however, does not mean “good.” The most important thing to note is that these kernel launch times are VERY SLOW. You can measure synchronous and asynchronous kernel launch times with the nullKernelSync.cu and nullKernelAsync.cu programs in the same directory. On this machine, those times are 46.35 and 3.25 microseconds, respectively. (In fairness, results likely would be better under Linux, especially the synchronous kernel launch. On Windows 7, launching a CUDA kernel always requires the driver to have the operating system do a user-kernel transition or “kernel thunk.” Sadly, no amount of editing can get around the sad fact that CUDA kernels and OS kernels are completely different things and some sentences must refer to both!)

On the Maxwell machine, whatever mechanism NVIDIA is using to copy managed memory back from the GPU has a maximum performance of less than 500MB/s. That’s a nonstarter. It is more than 25x slower than the bus bandwidth. Pascal has improved things, but is still less than half the performance of a PCI Express 3.0 link. A CUDA kernel reporting results via mapped pinned memory would achieve much higher performance.

Superficially, NVIDIA’s implementation makes sense, assuming there is one CPU and one GPU and that the application isn’t doing any fancy tricks with CPU/GPU concurrency. The main mistake in their implementation was failing to speculatively copy extra pages back from the GPU to the CPU in Step 6, an oversight that seems to have been remedied in subsequent releases. The overhead of servicing the page fault is so high that it’s dominated by interrupt handling, not copying of a 4K page, so it makes sense to copy more pages on the page fault until the overhead of the additional copying becomes non-negligible.

Less clear, however, is the optimal behavior of managed memory in a system with multiple GPUs. Does a managed memory buffer get allocated for each GPU? When a kernel is launched on GPU 0, do the other GPUs get copies of the managed memory? Which memory ranges are valid for which GPUs as kernels are executing? And it seems clear that managed memory can’t possibly retain the property that the “owning” device can be inferred from a UVA address, by e.g. calling cudaPointerGetAttributes().

The paradigm also breaks for applications that perform memory copies and kernel processing concurrently.

I submit that the APIs needed to “enlighten” the managed memory subsystem to do the right thing, are at least as complicated as simply writing the CUDA code to explicitly allocate and copy memory.

More on managed memory soon.