Monthly Archives: August 2017

Managed Memory: Belated Comments on Implementation

Managed memory is a new-ish CUDA feature that aspires to do away with the need to explicitly copy CPU memory to and from GPU memory. Introduced in CUDA 6.0, its initial implementation was unusably slow. (For example, copying managed memory from GPU to CPU memory ran at 512MB/s, 25x slower than an asynchronous memcpy.)

When they undertook to build the managed memory feature, NVIDIA had many different implementation strategies they could have pursued. As best I can tell, here is a summary of their implementation strategy:

  1. Upon allocation of managed memory, the CUDA driver allocates device memory, plus a pageable range of CPU memory at the same virtual address range.
  2. The CUDA driver use dirty page bits to track which 4K pages were “touched” by the CPU.
  3. Upon kernel launch, the CUDA driver would unmap the managed memory from the CPU and copy the dirty pages from the CPU to the GPU. Unmapping the CPU memory removes the risk of write-after-read hazards from the CPU corrupting managed memory before the GPU was able to copy it.
  4. While CUDA kernels run, the device memory copy of the managed memory is the only valid one.
  5. Upon CPU/GPU synchronization, the CPU buffer is made accessible again, but is not copied wholesale from GPU memory. It is possible the GPU’s hardware does not have the same dirty bit tracking facilities as the CPU, or perhaps NVIDIA just thought it would be preferable to copy device memory back to the CPU “on demand.”
  6. Copying managed device memory back to host memory is prompted by page faults: when the CPU attempts to access a page of managed memory, the CUDA driver handles the page fault by copying the 4K of GPU memory to CPU memory.

The application I used to investigate NVIDIA’s managed memory implementation is only about 60 lines of code. The key component is a function usPerLaunch that allocates a specified amount of managed memory, launches a NULL kernel, synchronizes with the GPU, then optionally “touches” the managed memory to force the CUDA driver to copy it back to host memory. (In an earlier version of this test, I confirmed that CUDA lazily copies only “dirty” pages in the other direction, as NVIDIA claims in its documentation.)

Here is the function in question – it may be found in cudahandbook/concurrency/

const size_t pageSize = 4096;
usPerLaunch( int cIterations, size_t cPages=0 )
    cudaError_t status;
    double microseconds, ret;
    chTimerTimestamp start, stop;
    void *p = 0;

    cuda(Free(0) );
    if ( cPages ) {
        cuda(MallocManaged( &p, cPages*pageSize ) );

    chTimerGetTime( &start );
    for ( int i = 0; i < cIterations; i++ ) {
        cuda(ThreadSynchronize() );
        if ( bTouch && 0 != p ) {
            for ( int iPage = 0; iPage < cPages; iPage++ ) {
                ((volatile unsigned char *) p)[iPage*pageSize] |= 1;
    chTimerGetTime( &stop );

    microseconds = 1e6*chTimerElapsedTime( &start, &stop );
    ret = microseconds / (float) cIterations;
    cudaFree( p );
    return (status) ? 0.0 : ret;

I ran this program on a Haswell-based Windows 7 machine on two NVIDIA GPU boards: the NVIDIA GeForce GTX 970 and Titan X (GM200 and GP100, respectively). Although both are large “win” chips, I would expect similar test results to hold true across all Maxwell and Pascal GPUs, since they seem to have implemented a hardware interface that improved managed memory performance.


Launch time (ms) Memory (KB) Bandwidth (MB/s)
47 0
105 4 39
104 8 78
115 16 143
134 32 244
213 64 307
381 128 344
649 256 404
1247 512 420
2221 1024 472
4712 2048 445
8458 4096 496
17041 8192 492
33992 16384 494

Table 1. GM200 results.

Launch time (ms) Memory (KB) Bandwidth (MB/s)
39 0 0
47.15 4 7
49.86 8 164
57.84 16 283
59.04 32 555
64.73 64 1012
79.08 128 1657
98.41 256 2664
137.15 512 3823
205.56 1024 5101
391.91 2048 5351
745.81 4096 5624
1543.91 8192 5433
3114.83 16384 5386

Table 2. GP100 launch results.

“Better,” however, does not mean “good.” The most important thing to note is that these kernel launch times are VERY SLOW. You can measure synchronous and asynchronous kernel launch times with the and programs in the same directory. On this machine, those times are 46.35 and 3.25 microseconds, respectively. (In fairness, results likely would be better under Linux, especially the synchronous kernel launch. On Windows 7, launching a CUDA kernel always requires the driver to have the operating system do a user-kernel transition or “kernel thunk.” Sadly, no amount of editing can get around the sad fact that CUDA kernels and OS kernels are completely different things and some sentences must refer to both!)

On the Maxwell machine, whatever mechanism NVIDIA is using to copy managed memory back from the GPU has a maximum performance of less than 500MB/s. That’s a nonstarter. It is more than 25x slower than the bus bandwidth. Pascal has improved things, but is still less than half the performance of a PCI Express 3.0 link. A CUDA kernel reporting results via mapped pinned memory would achieve much higher performance.

Superficially, NVIDIA’s implementation makes sense, assuming there is one CPU and one GPU and that the application isn’t doing any fancy tricks with CPU/GPU concurrency. The main mistake in their implementation was failing to speculatively copy extra pages back from the GPU to the CPU in Step 6, an oversight that seems to have been remedied in subsequent releases. The overhead of servicing the page fault is so high that it’s dominated by interrupt handling, not copying of a 4K page, so it makes sense to copy more pages on the page fault until the overhead of the additional copying becomes non-negligible.

Less clear, however, is the optimal behavior of managed memory in a system with multiple GPUs. Does a managed memory buffer get allocated for each GPU? When a kernel is launched on GPU 0, do the other GPUs get copies of the managed memory? Which memory ranges are valid for which GPUs as kernels are executing? And it seems clear that managed memory can’t possibly retain the property that the “owning” device can be inferred from a UVA address, by e.g. calling cudaPointerGetAttributes().

The paradigm also breaks for applications that perform memory copies and kernel processing concurrently.

I submit that the APIs needed to “enlighten” the managed memory subsystem to do the right thing, are at least as complicated as simply writing the CUDA code to explicitly allocate and copy memory.

More on managed memory soon.