Why Does CUDA Have A Current Device?

December 14, 2023Uncategorizedadmin

A Tale of Technical Debt

Introduction

Early on in CUDA’s development, I made a change to the driver API that moved the context from being a parameter to every function, to being stored in a TLS (thread local storage) slot.

The interface change resulted in the context parameter being removed from almost every CUDA entry point. For example, the function to create a CUDA array changed from:

CUresult cuArrayCreate( ctx, array, descriptor );

to:

CUresult cuArrayCreate( array, descriptor );

and the context was taken from a TLS slot. TLS slots, described here for the GNU toolchain and here for Microsoft Windows, resemble globals, in that they can be accessed at any time without having been passed as a parameter; but every CPU thread gets its own copy of the contents. Whenever the operating system does a context switch and swaps in a new thread, it updates the registers to make it so the thread gets its own copies of its TLS slots.

The change engendered a conversation among the scant few developers working on CUDA at the time. All other things being equal, statefulness is a bad idea in API design, because it makes software less composable. To examine the implications of changing the scope of state, we’ll first examine the floating point control word (FPCW), an example of statefulness in CPU designs that causes problems.

FPCW (Floating Point Control Word)

The Floating Point Control Word (FPCW) is a special register that dates back to the original 8087 floating point unit (FPU), which first became available from Intel in 1981. As you might surmise from the name, the FPCW contains bit fields that control the precision, round mode, and exception handling behavior of every floating point operation. Although the designers of the IEEE floating point standard anticipated benefits of having a register that implicitly affected every floating point operation, that implicit behavior has had unintended side effects, as described around the Internet in fpcw trashing, revisited, or Third party code is modifying the FPU control word, or Someone’s Been Messing With My Subnormals.

If a function changes the FPCW, the behavior of your floating point code can change as a result of having called a seemingly-innocuous function. There also have been cases where loading DLLs (including the Microsoft C runtime) changed the FPCW, which is especially problematic given that Windows has no ordering guarantees as to the order in which DLLs get loaded!

The problem is that across a function call boundary, there are several equally legitimate considerations concerning the state in question:

The caller may wish to influence the callee’s behavior by controlling the state→the callee must not modify the state.
The callee may be providing the service of setting the state to its caller→the callee modifies the state on behalf of the caller.
The callee may have specific requirements for the state that conflict with the caller’s requirements→the callee must save and restore the state.

Case 1) is reflected in the IEEE floating point specification, which requires that compliant implementations include a round mode in the control word. The directed rounding modes (round toward negative and positive infinity) were added to support interval arithmetic, where numbers are represented by a range (lower and upper bounds) instead of a singular value that is rounded to the precision of the floating point format being used for the computation. To correctly perform arithmetic on intervals, the lower bound must be rounded toward negative infinity and the positive bound must be rounded toward positive infinity. If we denote an interval as a tuple [lwr, upr], interval addition is implemented as follows:

a+b[lwr,upr] = [RoundDown(a.lwr,b.lwr),RoundUp(a.upr,b.upr)]

The reason the standard specified a round mode that implicitly affects behavior, rather than simply defining the operations needed by the standard, is that the designers believed interval arithmetic could be implemented by calling a function twice: compute the lower bound by calling the function with the round mode set to RoundDown, then compute the upper bound by calling the function with the round mode set to RoundUp. The problem, as shown above, is that even primitive arithmetic operations like addition must be done with different round modes in close proximity. If the round mode must be changed with great frequency, the performance hit from the increase in static and dynamic instruction count is exacerbated by a quirk of implementation: updates to FPCW are expensive.

I am a little perplexed as to why the IEEE specification requires that these round modes be included in a persistent state, rather than in each instruction. When the first edition of the standard was ratified, FPUs (floating point units) were still in the early stages of development; it may be that CPU designers simply did not want to waste op code space on round modes. The original FPCW also included mode bits for precision, a design tradeoff that sort of made sense for the Intel 8087 (which has a stack of up to 8 registers in a canonical 80-bit precision) but does not make sense when 64-bit double precision values occupy twice as much register space as 32-bit floats. For this reason, the SSE and, later, AVX variants of the x86 instruction set adopted the model of including the precision of the floating point operation in the instruction.

Cases 2) and 3) came to light during development of Direct3D’s geometry pipeline. For Case 3), we required that the FPCW be set to 32-bit precision, to ensure that divide instructions would execute faster. For Case 2), we found that forcing the FPCW to 32-bit precision caused some software to misbehave because its code relied on the FPCW being set to 64-bit precision. One way to deal with the caller and callee disagreeing on how the FPCW should be set would be to update the ABI to save and restore the FPCW across function call boundaries. The problem is, some functions provide the service of setting the FPCW. In Direct3D9, applications can specify the D3DCREATE_FPU_PRESERVE flag, instructing Direct3D not to tamper with the value of the FPCW: “Set the precision for Direct3D floating-point calculations to the precision used by the calling thread.”

More recent instruction set designs, such as NVIDIA’s GPU instructions for directed rounding, tend to specify the round mode on a per-instruction basis. Hence, CUDA intrinsics like __fadd_rd correspond to instructions that round in a particular direction, as detailed here. Enabling floating point operations to be rounded in different directions on an instruction-by-instruction basis is a better fit with the requirement to implement fast interval arithmetic.

The DEC Alpha instruction set struck an elegant compromise between the two: it encoded the round mode as a 2-bit field in the opcode, and one of the encodings specified that the round mode be retrieved from the floating point control word. This feature enabled Alpha developers to have their cake and eat it too: they had per-instruction control over the rounding, but if they wanted their caller to specify the round mode, they could just compile their apps to use the control word all the time. The only slight oversight in the design was that because only 2 bits were reserved in the op code, and there are four (4) IEEE-compliant round modes, the designers had to pick one of the 4 round modes to be the one that could only be specified through the control word, and unfortunately they chose round-up (round toward positive infinity). Probably they should have given that status to the most common round mode, round-to-nearest-even. If ever there were a round mode that developers wanted to override with the control word, round-to-nearest-even is the one.

What does all that FPCW history have to do with CUDA contexts? Well, just as every floating point operation must have a precision and a round mode, almost every CUDA operation must have a CUDA context. Just as instruction set designers had to choose between specifying the round mode in every instruction, versus inferring it from the control word, we had to choose between having every CUDA function take the context as a parameter, or placing it into less-frequently referenced state where we could infer the value.

Given all the difficulties discussed above with having per-thread state instead of per-instruction (or per-API call) parameters, why would we willingly incur this pain?

Current CUDA Contexts

There is a simple explanation as to why it made sense for us to make CUDA context per-thread. If you go back to the earliest versions of CUDA, you will see that the set-context functions required that a context can only be current to one thread at a time.

By imposing this restriction, we were able to ensure that CUDA contexts were thread-safe, a table stakes requirement for API designers in the mid-2000s and most API designers today. (The CUDA team seems to have revisited this requirement for CUDA Graphs, reverse-delegating thread safety onto API clients.) We were deliberately taking on technical debt, because making CUDA contexts thread-safe would not have been a good use of our scant engineering resources. There were several levels of granularity for thread safety in the driver: not just global and context-wide, but also some more-granular data structures that all had their own mutexes. When multiple levels of scope are involved, deadlock scenarios in large, complex code bases become eminently plausible, and writing the test code to smoke out those bugs didn’t seem like a good use of resources when we could just legislate our way out of the problem.

And we didn’t need multiple CPU threads to “feed” our GPUs; with streams and events, a single CPU thread was adequate to keep a GPU busy. We had plausible use cases where we wanted to concurrently drive multiple GPUs, but the application could just create a thread per GPU and make a different CUDA context current to each one.

The reason I can frame this decision as technical debt is because on the one hand, by ensuring that only one thread could be in a context at a time, we were ensuring that the API was thread-safe; but on the other, once we decided to make CUDA thread-safe, we could seamlessly expose that new capability simply by relaxing that restriction and enabling multiple threads to have a given CUDA context current at a time. CUDA 4.0 (c. 2011) was the first version to enable this functionality.

Technically, there is a compatibility break when you relax parameter validation restrictions; but it is such an innocuous compatibility break that we all take for granted that it won’t impact any real-world applications. Consider that the gold standard of compatibility, the x86 instruction set, also technically breaks compatibility whenever new instructions are added, because formerly-invalid op codes suddenly start having architectural side effects instead of signaling invalid-opcode exceptions. Similarly, relaxing CUDA 4.0’s parameter validation caused any apps that previously had been attempting to attach CUDA contexts to more than one thread at a time, to start succeeding where they had been failing. As with new x86 instructions, the number of CUDA apps that stopped working because NVIDIA relaxed this restriction is zero, or close to it.

Unlike the FPCW, where a clear industry consensus seems to be driving toward more-granular control, switching CUDA contexts is sufficiently uncommon that having the context current to CPU threads still seems like the right tradeoff. If anything, with benefit of hindsight, it made sense to conflate contexts and devices, as the CUDA runtime did. In practice, CUDA now does this, with the abstraction “primary context” being preferred, and having multiple driver API contexts per device being strongly discouraged. But in the mid-2000s, introducing a new, needed abstraction would have been much more difficult than keeping two where one could be hidden at the behest of the toolchain and runtime, exposed later when it made sense (as CUDA modules were hidden, then exposed when NVIDIA decided to add runtime compilation of source code).

Reflection

There is one major decision relating to CUDA contexts that I would’ve made differently, and that is the current-context stack. When we first shipped the ability to detach CUDA contexts (CUDA 2.2, if memory serves), I added cuCtxPushCurrent() and cuCtxPopCurrent() to push and pop the current context. Additionally, the specification for cuCtxCreate() was revised to state that if successful, the newly-created context was pushed onto the calling thread’s context stack.

When I was on-site at NVIDIA headquarters at some point during this design process, Chris Lamb tried patiently to get me to change this API before it shipped, but I was invested in the policy imposition baked into the stack, and he gave me the benefit of the doubt. The intention was to codify Case 3 – an explicit save and restore of the state – into the API set. As Chris pointed out then, and as is now reflected in CUDA’s current API set, developers are accustomed to get/set semantics, not push/pop semantics, and it would have been better to adhere to the conventional wisdom. The reason conventional wisdom is conventional is that it’s usually correct!

Asynchronous Error Handling Is Hard

November 2, 2023Uncategorizedadmin

or: How I Learned To Stop Worrying And Deprecate cudaGetLastError()

Every API designer has struggled with the question of how best to propagate errors to their callers, since before the term “API” was invented. Even decades ago (say 30+ years), interface designers knew to separate the error return from the payload, in functions that return other results to their caller.

Since it is sometimes useful to know what not to do: My favorite example of an antipattern in this department is in the venerable atoi() function in the C runtime, which converts a string to an integer:

int atoi (const char * str);

Since there is no guaranteed-invalid value of int to enable the caller to see that an invalid string (one that cannot be interpreted as an integer) was passed to the function, this interface is fatally flawed. Some implementations even stipulate that the error return is 0, leading to questions about whether the string “0” is a valid input.

The C runtime is lousy with crummy interfaces like this one, and may be the best evidence that the skill sets needed for good language designers and good interface designers are anti-correlated.

Okay – now that we have an example of what not to do, let’s consider the other options for error handling. They fall roughly into three categories:

Exceptions: functions in the API may signal exceptions in the target language. Anyone in the call stack can field the exception by enclosing the API call(s) in a try clause.
Immediate error return: a unified, API-specific error type is defined, with codes that specify the error.
Get-last-error: when an error occurs, a global error state is set that may be queried later by another function in the API.

Let’s start by ruling out exceptions. Even if the target language is known, and supports exceptions, (C++ and Python come to mind), exceptions aren’t always the best choice. After 30 years of exceptions being present in the C++ language, they still are controversial because it is generally difficult to inspect code and know where and why exceptions may be signaled. Raymond Chen, a legendary developer at Microsoft who has worked on Windows for more than 30 years, wrote a few blogs on this topic almost 20 years ago. He had to follow up “Cleaner, More Elegant, and Wrong” with “Cleaner, More Elegant, and Harder To Recognize” because a blanket condemnation of exceptions is hard to support. But when building libraries like the CUDA runtime, or CUDA’s domain-specific libraries such as cuBLAS, cuSOLVER, and the like, it is easy to rationalize reverse-delegating the use of exceptions onto your caller: your client always can, at their own discretion, transmute error returns into exceptions. On the flip side, to incorporate exception handling into such a library (i.e. define an exception type and document which functions will signal which exceptions under which circumstances) is a policy imposition on your client. When you add the consideration that exception handling is language-specific, eschewing exception handling in language-independent library designs becomes an even easier decision.

As far as immediate error returns go: this was the design choice for CUDA from the beginning, with a caveat we’ll get to in the next section. From the beginning, CUDA supported immediate error returns in the form of the CUresult type for the driver API, and cudaError_t for the CUDA runtime. Pretty much all of the API functions return an error status; if they allocate a resource on behalf of the caller, they take a parameter that enables the function to pass back a handle to that resource. That is why cudaMalloc() returns cudaError_t instead of void *, like the C runtime malloc(). Even though cudaMalloc() could have emulated the C runtime’s malloc(), returning a guaranteed-invalid value in the event of failure, it was more important for the CUDA runtime to be self-consistent than to echo the decades-old API design of the C runtime. (If you are jumping up to defend the C runtime, I invite you to review the semantics of the realloc() function. I can do this all day!)

The main problem with immediate error returns is that they can only be checked at the call site. If a client calls a function and it returns an error, and the caller chooses to ignore that error, the details of the error return are lost in the mists of time. But that putative bug is often a feature – in other words, it is common to write code that explicitly ignores possible error returns. Consider a function that allocates two buffers on behalf of its caller:

cudaError_t allocateTwoBuffers( void **bufferA, size_t nA, void **buffer, size_t nB )
{
    cudaError_t err = cudaMalloc( bufferA, nA );
    if ( cudaSuccess == err ) {
        err = cudaMalloc( bufferB, nB );
        if ( cudaSuccess != err ) {
            cudaFree( bufferA );
        }
    }
    return err;
}

Here, the return value from cudaFree() is ignored, because we want to propagate the resource allocation failure, not whatever cudaFree() returns.

And now is a good opportunity to give the C runtime some grace, at least after it was worked over by standards committees, because free() returns void; NULL is specified to be a valid input to free(); and free(NULL) is defined to be a no-op. With benefit of hindsight, it might have been prudent to define similar semantics for the family of CUDA APIs that free resources – not just cudaFree(), but cudaStreamDestroy(), cudaEventDestroy(), and so on. But applications can get the same effect by ignoring the return values from those functions, as illustrated by allocateTwoBuffers() above .

allocateTwoBuffers() is not structured in a way that I’d publish in the CUDA Handbook, because a goto-based error handling idiom scales better if, say, the function is modified to allocate more resources on behalf of its caller.

With exceptions and immediate error returns dealt with, that brings us to get-last-error idioms, which may be found in APIs such as OpenGL with glGetError(), Win32’s GetLastError(), and now CUDA’s cudaGetLastError() and cudaPeekLastError(). It even may be found in the C runtime, in the form of the errno global.

The exact semantics of get-last-error patterns vary from API to API; certainly not every Win32 function updates the error that would be returned from GetLastError(), and not every C runtime function sets errno. But the general idea is that if an error occurs, a global (or on multithreaded systems, thread-local) variable is set that may subsequently be examined to see if an error occurred. The advantage to this approach is that the error is “sticky,” so the error condition can be examined in places other than the call site where an API function returned an error.

I have avoided get-last-error idioms in my APIs, for several reasons:

They are not thread-safe – the “last error” for one CPU thread obviously differs from the “last error” returned to another thread, so the “last error” must be stored in a TLS slot.
They make for a developer education challenge: How often should the last-error be checked? After every call? I do know of a prominent OpenGL application that calls glGetLastError() after every API call. But, no API designer worth their weight in salt would advocate for doubling the number of API calls by their clients. So, how often?
The semantics around setting, examining, and clearing the last-error must be clarified, as evidenced by CUDA having separate cudaGetLastError() and cudaPeekLastError() functions. cudaGetLastError() clears the error code, so it can only be called once downstream of the error. That presents issues that resemble the problem with immediate error codes, which is that they can only be checked at the call site. cudaPeekLastError() gives developers more creative control, but large bodies of code must carefully define their own protocols as to when these various error handling functions are called.
Most importantly, as evidenced by the earlier code fragment allocateTwoBuffers(), it sometimes makes sense for applications to explicitly ignore potential error codes, while passing along error codes that had been returned before.

So CUDA uses immediate error returns. All of CUDA’s functions return error codes, and well-written CUDA applications check every single return value. It is rare for CUDA functions to be unable to fail. As an API designer, I am a big proponent of designing functions that cannot fail, but those are hard to engineer. Typically they entail allocating resources preemptively, so that a resource is known to exist by dint of the code executing at all. (This is why the Windows NT kernel’s family of thread synchronization allocation functions, such as KeInitializeMutex() and KeInitializeSemaphore(), return void. They often are called on memory that was statically allocated, so if the code is running at all, we know the loader successfully allocated the memory where the mutex or semaphore will reside.) CUDA also preallocates resources to decrease the likelihood of a runtime failure; as an example, when CUDA was first built, we made sure every CUDA context preallocated all the staging buffers needed to perform any memcpy. If the allocations of those staging buffers (or any of the other innumerable resources needed by a CUDA context) failed, the CUDA context creation would fail; so anyone with a valid CUDA context handle is in possession of a CUDA context that can perform memcpy’s without ever failing due to a failure to allocate a staging buffer. But CUDA memcpy functions can fail for other reasons, like if the application passes pageable host memory to an asynchronous memcpy. Other examples of functions that are unlikely to fail include cuEventQuery() and cuStreamSynchronize(); but those functions can trigger submissions of pending work to hardware, which can result in failures per the discussion in the next section. So callers still must check their return values.

Considering CUDA is more than 15 years old, there is a surprising diversity in error-handling idioms. Even setting aside exactly what you want to do if an error occurs—print an informative message and exit, propagate the error to your caller, signal an exception, and so on—there is no consensus as to whether to call a templatized utility function or just use a preprocessor macro. Most developers do define macros, though, with names like SAFE_CUDA or CUDART_CHECK (my choice for the CUDA Handbook sample code before I refactored it to use Allan MacKinnon’s clever prefix-macro idiom).

Even the CUDA SDK samples do not have a standardized method of error handling.

Implicit Get-Last-Error

A close examination of CUDA’s immediate-error based API, of course, shows that there is a get-last-error semantic hiding in its putatively immediate error handling regime. cudaDeviceSynchronize() and many CUDA APIs are annotated with language along these lines: “Note that this function may also return error codes from previous, asynchronous launches.” The reason is because any kernel invocation can cause runtime errors that are all-but-impossible to recover from, like dereferencing an invalid pointer. Without providing for more granular error checking, asynchronous APIs (like kernel launch, which always has been asynchronous, even on CUDA 1.0) all but require a get-last-error paradigm. Checking for errors requires CPU/GPU synchronization, which hinders performance. A serviceable precedent to cite here is the way NaNs and INFs propagate in floating point computations – checking every arithmetic operation is too expensive, so having sticky errors propagate into results where they can be checked later, at a coarser grain, was deemed the right tradeoff by the IEEE standards committee.

As best I can tell, it is possible to write completely correct CUDA programs without ever calling cudaGetLastError(). If there are any CUDA functions that set the last-error, but do not return an immediate error code, then NVIDIA made every developer’s job harder for no discernible reason; but there’s no evidence that NVIDIA did that. It looks like they just added the function to more explicitly surface the last-error semantics already baked into the CUDA runtime.

A Prescriptive Remedy

More fine-grained error reporting that didn’t noticeably hinder performance would be relatively easy to add, by the way: just overload CUDA events a bit more. We already have CUDA events that may or may not use blocking waits; why not add an error reporting mechanism? That way, applications could at least bracket which kernel launch caused an error. A new function could query the event’s error status would return the get-last-error status as of the time that the event was signaled:

CUresult cuEventCheckError( CUresult *p, CUevent hEvent, uint32_t Flags );

This hypothetical API could enforce that the event has to be signaled to be a valid input, or you could specify a flag that forces the function to be synchronous (i.e. wait until the event’s wait is satisfied before returning the event’s get-last-error status). The details don’t matter. The API would still have a get-last-error vibe, because if there were multiple kernel launches or CUDA API calls before the event, the same ambiguity would still exist as to which operation caused the error to be signaled. But the granularity would be under control of the application developer, and it would run as fast as CUDA hardware can process events.

cudaGetLastError(): Just Say No

In the meantime, my general take on cudaGetLastError() is that since well-written CUDA code does not have to use it, well-written CUDA should not use it.

The Utility of Futility

October 16, 2023Uncategorizedadmin

Fatalism Can Be Useful Sometimes

Introduction

A wise man once told me: “Scientists build in order to learn. Engineers learn in order to build.” One of the most exciting experiences for an engineer is realizing that new discoveries or technologies enable new possibilities, but part of an engineer’s job also is to save work by eliminating possibilities. For example, it sometimes comes in handy to know that nothing can travel faster than the speed of light in a vacuum.

I call this principle “The Utility of Futility,” and it is an ethos that more software engineers would do well to embrace. As a species, software engineers are incorrigible optimists, and we work in a profession where 90% solutions can be deceptively easy to develop. These characteristics may lead us to overlook opportunities to exploit the utility of futility. But before exploring some costly mistakes that could have been avoided by embracing this ethos, let’s review some related areas where system design was informed by a recognition of what’s not possible.

Speed Bumps

Our first story begins with a (metaphorical) speed bump.

Work on CUDA began in earnest in early 2005, and when I joined the team, the driver consisted of a few hundred lines of code. Most of that code implemented a handle allocator that would ‘allocate’ fixed-length memory buffers (e.g., context structures) that then could be referenced through integer handles. Internally, the driver then would translate these handles into pointers through pointer arithmetic on the buffer from which the index had been allocated.

Early on, we replaced these integer handles with so-called “opaque pointers,” replacing e.g.

typedef unsigned int CUcontext;

with:

typedef CUctx *CUcontext;

Note that the typedef does not declare the context structure – it declares a pointer to same. C/C++ clients of CUDA know that there is a context structure, but they do not know what it contains.

When this minor refactoring was done, it caused a stir and prompted an internal discussion on our development team. To some, the opaque handles seemed more secure, because the driver had to run some code to translate them into pointers to the driver’s internal structures. And developers who reverse-engineered the structure’s layout and then took advantage by, say, hard-coding offsets from the structure pointer into their own applications, would be risking their own applications’ continued functionality. Such code definitely breaks compatibility with future versions of CUDA, among other things. Why not put a little speed bump in the path of such developers?

Applying the utility of futility, we decided that the benefits of the speed bump were outweighed by the additional complexity needed to implement a fixed-length allocator and handle validation. Any developers determined to reverse-engineer the layout of the CUDA context structure would be able to do so.

One can consider this particular “utility of futility” story to occupy a gray area. Most do. Intel is famed for backward compatibility, but every time they ship a new instruction set, they break compatibility in a subtle way: the new instructions—previously-invalid opcodes—execute and have architectural side effects, instead of signaling invalid-opcode exceptions! But any developer who ships software that relies on that behavior would elicit little sympathy: about the same amount as any developer who reverse-engineered the layout of an internal structure in CUDA.

The Leaking Nanny

Software that runs in data centers must be robust. If a server running hundreds of virtual machines loses power, it must resume running them after regaining power, having lost as little work as possible. Such robustness involves heroics like journaling disk traffic to solid state drives that have enough capacitance built into their power supplies to post all their pending writes in the event of a power failure. The hypervisor software has to be able to restart all those virtual machines, in as close a state as possible to whatever they were doing before the server lost power.

I once worked at a utility computing vendor that ran a management process whose memory usage would steadily increase, and no one could figure out why. There was a memory leak somewhere in the code, and the code was big and complicated and written in a garbage-collecting language that made it difficult to diagnose such issues. Eventually, the excess memory usage caused the server to fail.

The “utility of futility” solution to this problem: instead of fixing the memory leak, the vendor simply created a watchdog that monitored this process’s memory usage and, when it became too much, killed the process. Remember, this process had been built to be robust in the face of power failure, so getting summarily executed by a peer process is also a recoverable event.

If the stopgap measure is using a feature that the system must deliver in any case, it needn’t stay a stopgap.

Exit On Malloc Failure

QEMU, the hardware emulator that enables virtualization for HVM guests on Xen, features an interesting engineering compromise: its internal memory allocator, the equivalent to malloc(), exits on failure. As a result, code reviews that check the return value from this function are rejected – the function either succeeds, or does not return at all because the whole process exited. The reason: gracefully handling out-of-memory situations introduces too much possibility for (presumably rare) and difficult-to-diagnose error, and therefore security risk. Since QEMU instances and their clients have to be robust in the same ways as the preceding system (e.g. recover machines in their latest-known states before the adverse event such as power failure), the “utility of futility” favored having malloc() exit rather than doing a prohibitively expensive and error-prone security analysis.

Memory Probes

As a young software engineer who cut his teeth on microcomputers with no memory protection whatsoever (IBM PCs running MS-DOS, Macs running the original MacOS), I was excited to work on Windows, a platform with some semblance of memory protection (don’t laugh – before Apple bought NeXT for its UNIX-like operating system, the MacOS was not any more secure than MS-DOS). Running your program under a debugger, invalid memory references in your program were flagged immediately.

But the Windows API had something more: functions IsBadReadPtr() and IsBadWritePtr() that could check the validity of a memory location. As a budding API designer, this function seemed like the perfect opportunity to elevate my parameter validation game: if my caller passed an invalid memory range to one of my functions, I could return an error rather than just having the program crash in my function.

The problem with this API is that even 16-bit Windows was a multitasking operating system. Memory could become invalid as a consequence of code running elsewhere in the system. Even if you wanted to build a function that “validated” a memory range before using it, one could contrive a scenario – say, a context switch at an inopportune time – where the memory was subsequently invalidated by other code running in the system. If Microsoft had recognized the utility of this futility, they would not have built this API, since all it did was give a false sense of security.

Note: In Windows NT, the structured exception handling (SEH) feature did, in fact, enable robust memory validation. Using SEH, the memory’s validity is evaluated at the time it is referenced, and not before, as when using a memory probe API. But in the intervening years, a consensus has developed among API designers is that the costs of such memory validation outweighs the benefits. It is left as an exercise for the student to determine whether APIs that crash when you pass in NULL parameters are a manifestation of the utility of futility!

Microsoft WDDM: A Missed Opportunity

This utility of futility story is going to require more background. A lot more background.

One of the signature achievements for graphics in Windows Vista (c. 2007) was to move most of the graphics software stack into user mode. Or I should say, back into user mode. For Windows NT 4.0 (c. 1996), Microsoft had moved GDI (the Graphics Device Interface) into kernel mode, reportedly after Michael Abrash buttonholed Bill Gates at a party (never a robust process for decision-making), and presumably because Abrash wanted to continue writing optimized software rasterizers and believed overall performance would be better if they were running in kernel mode. Wrong on all counts, and fodder for another blog to be written someday.

By the time Windows XP shipped (c. 2001), it was abundantly clear that moving GDI into kernel mode had been a catastrophic mistake, because the entirety of display drivers got moved into kernel mode with it, and by the time Windows XP shipped, graphics drivers included things like pixel shader compilers that were waaay too unstable and used waaaay too much stack to run in kernel mode. (In Windows NT, kernel stacks are limited to 12K on x86, or 24K on x86-64 – big enough for reasonable kernel applications, but not for things like shader compilers that should not run in kernel mode at all.) In fact, when I ported NVIDIA’s graphics driver to x86-64, one of the things I had to do to get the compiler running was spawn another thread and delegate onto it just to buy another kernel stack. Fortunately, the shader compiler didn’t seem to need more than 2 kernel stacks, or I would’ve been tempted to build a kernel stack usage nanny that spawned threads on an as-needed basis just to emulate a larger stack!

By the time Windows XP shipped, there was widespread consensus that most of the graphics driver had to be moved back to user mode. But at Microsoft, the kernel team still harbored a deep distrust of graphics hardware vendors, fueled by a mid-1990s era of incredibly buggy hardware, operated by poorly written drivers that had to try to work around the buggy hardware. Back then, dozens of hardware vendors had been competing for OEMs’ business, and schedule slips could be fatal; as a result, many hardware bugs were left in, as long as they could be papered over by driver work (provided those driver workarounds did not have too much performance impact). The bulk of this activity was occurring on Microsoft’s Windows 95 platform, which was built on a completely separate code base from Windows NT. Cries from the NT kernel team, who wanted robust hardware and drivers, went unheard by hardware developers who were more concerned about their companies’ continued existence. The number of OEMs was daunting as well as the number of graphics IHVs: graphics companies such as S3, ATI, Cirrus Logic, Tseng Labs, Matrox, Chips and Technologies, Oak Technology, Number Nine, and Trident were selling to OEMs such as Acer, AST, Compaq, Dell, Gateway 2000, HP, IBM, NCR, NEC, and Packard Bell. Both of these businesses were competitive, with new entrants funded either by large companies seeking to diversify, or startups seeking to parlay niche expertise into market share. Consumer electronics titans Samsung and Sharp entered the PC business, for example, while startups like 3Dfx, 3Dlabs, Rendition, and NVIDIA entered the graphics chip business.

Suffice to say that in that competitive environment, graphics chip companies were in no mood to slip schedule just to make their hardware more robust for a workstation platform whose unit sales were a fraction of the consumer platform – even if the workstation platform represented the future.

By the early 2000s, the competitive landscape had shifted, for graphics IHVs at least. Companies that couldn’t deliver competitive performance or features were acquired, went out of business, or were relegated to the margins. Consolidation eventually reduced the major players to a handful: Intel, NVIDIA and ATI accounted for most unit sales. These companies all had the wherewithal to build robust hardware and drivers, but between ongoing fierce competition and vastly more complicated hardware, the vendors did little to earn back the trust of the Windows NT kernel team after losing it in the 1990s.

To understand the landscape, it’s important also to understand the organizational tension between the NT kernel team and the multimedia team that owned the Windows graphics stack. Much of the NT kernel team, led by the brilliant and cantankerous operating system architect Dave Cutler, had been recruited to Microsoft from Digital Equipment Corporation in 1987; in contrast, the multimedia team that owned the 3D graphics drivers had developed in the Windows 95 organization and been reorganized into the NT organization in 1997. So, as the multimedia team redesigned the graphics stack to move most code back into user mode, they were doing so under the watchful eye of skeptical kernel architects who did not particularly trust them or the vendors whose capabilities were being exposed by the multimedia team.[1]

Even the hardware interfaces had changed so much that they would’ve been unrecognizable to graphics chip architects of the mid-1990s. Instead of submitting work to the hardware by writing to memory-mapped registers (MMIO), the drivers allocated memory that could be read directly by the graphics chips (via direct memory access or DMA), filled those buffers with hardware commands, then dispatched that work to the graphics chips[2]. Given that the NT architecture required that hardware be accessed only from kernel mode, management of these “command buffers” presented a challenge to the multimedia team. For performance and platform security, the bulk of the code to construct these command buffers had to run in user mode; but in keeping with the NT architecture, the command buffers could only be dispatched from kernel mode.

To avoid extra copying, Microsoft designed the system so hardware-specific commands to be written directly into these buffers by the user mode driver, since one vendor’s idea of a “draw triangle” command may differ from that of another. These commands would be queued up until the command buffer was full, or had to be submitted for some other reason; the system them would do a “kernel thunk” (transition from user to kernel mode), where the kernel mode driver would validate the buffer before submitting it to the hardware.

For those familiar with the NT architecture, the flaw in this design should be obvious, and is somewhat related to the preceding memory probe “utility of futility” story: since Windows is a multitasking API, the buffer can be corrupted during validation by the kernel mode driver. No amount of validation by the kernel mode driver can prevent corruption by untrusted user mode code between when the kernel mode driver is done with the validation, and when the hardware reads and executes the commands in the buffer.

It is, frankly, incredible to me that this platform vulnerability was not identified before the WDDM design was closed. The NT kernel team may not like having to trust graphics hardware, but as long as buffers can be corrupted by user mode code before the hardware can to read it, the only way to build a robust platform is to have the hardware validate the commands.

Another way to protect from this race condition would be to unmap the buffer so user mode code wouldn’t be able to change it, but editing the page tables and propagating news of the newly-edited page tables (“TLB invalidations”) would be too costly.

Conclusion

As you explore design spaces in software architecture, if you can prove that an interface is making promises it is not in a position to keep, don’t be afraid to invoke the Utility of Futility to simplify the system.

[1] Talking with members of the NT kernel team was illuminating, because they had a different relationship with hardware than did the multimedia team; they literally had fixed the Pentium FDIV bug by trapping and emulating the FDIV (floating point divide) instruction. But since FDIV is infrequently executed, emulating it incurred a modest performance penalty that would go unnoticed by end users without the aid of measurement tools. Even if graphics hardware were designed to be trapped and emulated, like the CPU instruction set, trapping and emulating graphics functionality in the NT kernel would incur a large-enough performance penalty that the resulting product would be unusable.

[2] These are commands such as “move this 2D block of pixels from here to there,” or “draw the following set of triangles.” The WDDM (the Windows Display Driver Model) architecture had many other features, such as supporting multitasking so many applications could do 3D rendering into the Windows desktop concurrently, but those details are not relevant to this “utility of futility” discussion.

A Paean To Structure-Based Interfaces

August 16, 2023Uncategorizedadmin

When the time came for us to gather requirements and start designing CUDA 2.0 (c. 2008), I was surprised to hear from our product manager that the key Next Biggest Thing in CUDA was… drum roll… 3D textures[1]!

The reason this priority was surprising is because adding texturing in the first place had been somewhat controversial. Why have graphics features in a general purpose computing ecosystem[2]?

Early versions of CUDA hardware did not have caches to mitigate the overhead of uncoalesced memory traffic, though, so applications that really needed misaligned reads had no choice but to use the texturing hardware. As an added bonus, the cache resources were separate, so by reading inputs via the texture pipeline, you were enlisting hardware resources that otherwise would have gone unused.

But… 3D texturing? Okay. That would require us to implement extensive, backward-compatible revisions to all of the CUDA array APIs, the texturing APIs, and the memcpy APIs. CUDA 1.0 already had a plethora of memcpy functions, especially at the driver API level, where we’d opted to try to maintain type safety between CPU and GPU pointers.

Adding new functionality in a backward compatible way is a long-held tradition in API design. For example, flags words are a good way for interface designers to future-proof their designs; or in a pinch, replacement functions may be provided that enable new functionality. Since those functions did not exist when legacy applications were being developed, the interface designer knows they can only be called by newer application expecting the new behavior. Sometimes we get both, even in close quarters: the driver API function cuEventCreate() takes a flags word that later was used for innovations such as blocking synchronization, but whomever wrapped the driver API in the CUDA Runtime omitted it from cudaEventCreate(), so NVIDIA later added the replacement interface with the awkward name cudaEventCreateWithFlags()!

The reason API designers have to be careful with this type of thing is because applications can rely on existing behaviors, a lesson that most interface designers learn through hard experience. Mine was acquired at Microsoft in the multimedia division…

Aside: How To Compatible

When I was Direct3D development lead in 1998 or so, we worked with Intel to enable a new software rasterizer that benefited from the new MMX instruction set. The reason we had to expose it as a new rasterizer, instead of simply replacing the incumbent RGB rasterizer with a faster one, was because applications would subtly break when using the new rasterizer, and no matter how hard Intel tried, they couldn’t just replace the old, slower, serial RGB rasterizer with a new, faster one. So, we added the MMX rasterizer as an option; that way, we had both the new MMX rasterizer and backward compatibility (old applications would continue to work). The only downside was that newer applications had to specifically ask for it, but Intel was okay with that, as a measure of accountability for not being able to develop a drop-in replacement for the RGB rasterizer. Such compatibility measures are called “opt-ins” because the application is opting into the new behavior: if a legacy application were ported to the newest interfaces without any changes, it still would get the old RGB rasterizer. A change to the source code, not just a recompile, was needed to get the new MMX rasterizer.

But when we simply started enumerating the MMX rasterizer to applications, we found a lot of compatibility breaks. Many applications ran much more slowly, and others crashed or otherwise didn’t work at all. When we debugged the problem, we found that we had to add a few more levels of opt-in to keep legacy applications working. You see, Direct3D always had reported 2 or 3 rasterizers: the “Ramp” rasterizer that ran quickly; the “RGB” rasterizer that was more functional, but slower; and, if hardware was available, the “HAL” rasterizer made hardware-acceleration available[3]. But by adding the MMX rasterizer, we’d increased the number of rasterizers in the API from 3 to 4 if hardware acceleration was available; and several different compatibility breaks became possible:

Some applications had statically declared a 3-element array of rasterizer handles that was now being overflowed by our 4-element response, causing the application to crash;
Some applications were assuming that if 3 rasterizers were enumerated, then hardware must be available;
Some applications were further assuming that if 3 rasterizers were enumerated, then the third must be the HAL[4].

The fix was to make enumeration of the MMX rasterizer an implicit opt-in if the application was accessing Direct3D using a new interface. We also wrote better sample code – that was where the statically declared arrays had come from[5].

After being burned a few times by promising that existing applications will continue to work on new operating systems (or new versions of Direct3D) and delivering disappointing results, interface designers start to learn tactics to make everyone’s jobs easier[6], not just their own, but also the lives of developers who are using their interfaces. With that in mind, let’s take a close look at the 3D memcpy APIs that were added in CUDA 2.0.

An API Design Headache

Set aside the complexities of 3D texturing for a moment and focus on the difficulties presented by just copying the memory from here to there. For starters, CUDA has three (3) different memory types: host memory[7], device memory, and CUDA arrays. Since CUDA arrays come in 2D and 3D variants, we also needed to support 1D, 2D and 3D memory copies. And because CUDA arrays’ layout is opaque to developers, the memory copy interface must support offset values (X, Y and Z).

Designing a set of distinct memory copy functions that covered the full Cartesian product of possibilities would have been an API design and development nightmare, accompanied by a developer education headache[8]. Between 3 source memory types, 3 destination memory types, and 3 dimensionalities for the memcpy’s, twenty-seven (27) memory copy functions would have had to be designed, implemented, and documented. Just coming up with a heuristic for naming would be a chore, reminiscent of the naming problem confronted by the designers of SIMD intrinsics.

For example, a function to copy a 2D slice of device memory into a 3D CUDA array might look something like this:

CUresult cuMemcpy2D_2DDto3DA( CUarray dstArray, size_t offsetX, size_t offsetY, size_t offsetZ, CUdeviceptr src, size_t srcPitch, size_t WidthInBytes, size_t Height );

The offset parameters are needed for CUDA arrays because the layout is opaque to developers. In this interface design, callers who are copying from a subset of a 2D array in device memory are expected to do the address arithmetic to specify the correct base pointer (for type T, this could be computed with an expression such as ((int8_t *) base)+srcY*srcPitch+srcX*sizeof(T) – perfectly intuitive)[9].

An alternative design would be a single entry point, with parameters designed to be ignored if not relevant (say, the height parameter if a 1D memory copy is being requested), would be more tractable. Such a function might look like this:

CUresult cuMemcpy3D( void *dstHost, CUdeviceptr *dstDevice, size_t dstPitch, CUarray dstArray, size_t dstXInBytes, size_t dstY, size_t dstZ, const void *srcHost, size_t const CUdeviceptr dstDevice, size_t srcPitch, CUarray srcArray, size_t srcXInBytes, size_t srcY, size_t srcZ,size_t WidthInBytes, size_t Height, size_t Depth );

This function has a dizzying number of parameters (17, in fact), but it does cover all of the use cases. The memory types of the source and destination would be inferred from whichever of srcHost/srcDevice/srcArray and dstHost/dstDevice/dstArray was non-NULL. (If more than one are specified, the function should fail.) The offset parameters are needed for CUDA arrays, since the layout is opaque to developers; the API designer then would have to decide whether to respect them if they are non-zero for host and device memory, or specify that they are ignored unless the participating memory type is a CUDA array. Finally, specifying a height of 0 and/or a depth of 0 naturally describes a 1D or 2D memory copy.

The problem with this function is that a single call to it occupies at least five lines of code to specify the 17 parameters. This interface is cumbersome at best! But, a better alternative to this mega-function is available, and that’s what we wound up shipping in CUDA: all of the parameters are collected together into a single structure that enabled the API client to separately describe the source and destination memory ranges, along with the location and size of the copy itself. The resulting structure had about two dozen members:

typedef struct CUDA_MEMCPY3D_st {
    size_t srcXInBytes;         /**< Source X in bytes */
    size_t srcY;                /**< Source Y */
    size_t srcZ;                /**< Source Z */
    size_t srcLOD;              /**< Source LOD */
    CUmemorytype srcMemoryType; /**< Source memory type (host, device, array) */
    const void *srcHost;        /**< Source host pointer */
    CUdeviceptr srcDevice;      /**< Source device pointer */
    CUarray srcArray;           /**< Source array reference */
    void *reserved0;            /**< Must be NULL */
    size_t srcPitch;            /**< Source pitch (ignored when src is array) */
    size_t srcHeight;           /**< Source height (ignored when src is array; may be 0 if Depth==1) */

    size_t dstXInBytes;         /**< Destination X in bytes */
    size_t dstY;                /**< Destination Y */
    size_t dstZ;                /**< Destination Z */
    size_t dstLOD;              /**< Destination LOD */
    CUmemorytype dstMemoryType; /**< Destination memory type (host, device, array) */
    void *dstHost;              /**< Destination host pointer */
    CUdeviceptr dstDevice;      /**< Destination device pointer */
    CUarray dstArray;           /**< Destination array reference */
    void *reserved1;            /**< Must be NULL */
    size_t dstPitch;            /**< Destination pitch (ignored when dst is array) */
    size_t dstHeight;           /**< Destination height (ignored when dst is array; may be 0 if Depth==1) */

    size_t WidthInBytes;        /**< Width of 3D memory copy in bytes */
    size_t Height;              /**< Height of 3D memory copy */
    size_t Depth;               /**< Depth of 3D memory copy */
} CUDA_MEMCPY3D_v2;

To call this API, typically the developer declares one of these structures on the stack, then passes the structure to the API itself[10]:

CUresult cuMemcpy3D( const CUDA_MEMCPY3D *cp );

Soon after I checked in the first draft of this API, I got an irate email from a fellow NVIDIAn who thought it was way too verbose. The exact wording is lost in the mists of time, but I remember a lot of heat! What he didn’t understand from a superficial reading of the API was that calling this function would only be about as complicated as the operation being requested.

For example, a simple 1D memcpy from host to device memory may be implemented as follows:

CUresult
cuMemcpyHtoD_via3D( CUdeviceptr dst, const void *src, size_t bytes )
{
    CUDA_MEMCPY3D cp = {0};
    cp.dstMemoryType = CUDA_MEMORYTYPE_DEVICE;
    cp.srcMemoryType = CUDA_MEMORYTYPE_HOST;
    cp.dstDevice = dst;
    cp.srcHost = src;
    cp.WidthInBytes = bytes;
    return cuMemcpy3D( &cp );
}

The declaration of cp zero-initializes the structure using a C programming construct that was valid in the 1970s, when the First Edition of the White Bible came out. As long as the interface has defined reasonable defaults to be zero, (such as the source and device pointers and offsets), callers can use this idiom to write compact, intuitive code.

A more complicated memory copy, like the 2D memcpy function we declared earlier, could be implemented as follows:

CUresult cuMemcpy2D_2DDto3DA( CUarray dstArray, size_t offsetX, size_t offsetY, size_t offsetZ, CUdeviceptr src, size_t srcPitch, size_t WidthInBytes, size_t Height )
{
    CUDA_MEMCPY3D cp = {0};
    cp.dstMemoryType = CU_MEMORYTYPE_ARRAY;
    cp.dstArray = dstArray;
    cp.dstXInBytes = offsetX;
    cp.dstY = offsetY;
    cp.dstZ = offsetZ;
    cp.srcMemoryType = CU_MEMORYTYPE_DEVICE;
    cp.srcDevice = src;
    cp.srcPitch = srcPitch;
    cp.WidthInBytes = WidthInBytes;
    cp.Height = Height;
    return cuMemcpy3D( &cp );
}

The structure-based interface delivers the best of both worlds: the caller can express any valid operation, but the function invocation is only as complicated as the operation requested! The main downside is that when implementing such an interface, parameter validation can be Byzantine in complexity. As a result, function call latency can become a concern, as small operations may spend more time validating the call than performing the requested operation. NVIDIA resolved this issue with CUDA graphs, which can record operations for later playback and enables parameter validation to be done at the recording stage.

With benefit of hindsight, there is one change to the interface that I wish I had thought to make before we shipped. When I designed the interface, I knew that assigning zero to the structure was valid K&R C from the 1970s – handy when zero is a good default value for most parameters! But if I’d eaten my own dogfood just a bit more, I might have thought to put the memory types at the beginning of the structure. They always are needed – you can’t specify a memory copy without specifying the types of the memory participating – and the developer education would’ve been even more intuitive, with the destination and source memory types specified at the site of the structure declaration. This idiom takes advantage of another subtlety of K&R C, namely that if only the first few members of a structure are initialized, the remainder of the structure is zero-initialized. Here’s a rewrite of our 2D-device-to-3D-array memcpy function that does just that:

CUresult cuMemcpy2D_2DDto3DA( CUarray dstArray, size_t offsetX, size_t offsetY, size_t offsetZ, CUdeviceptr src, size_t srcPitch, size_t WidthInBytes, size_t Height )
{
  CUDA_MEMCPY3D cp = {CU_MEMORYTYPE_ARRAY, CU_MEMORYTYPE_DEVICE};
  cp.dstArray = dstArray;
  cp.dstXInBytes = offsetX;
  cp.dstY = offsetY;
  cp.dstZ = offsetZ;
  cp.srcDevice = src;
  cp.srcPitch = srcPitch;
  cp.WidthInBytes = WidthInBytes;
  cp.Height = Height;
  return cuMemcpy3D( &cp );
}

Named Parameters

C++20 gives us an even more intuitive way to access such interfaces, first by enabling the parameter structure to be declared inline with the function invocation (not always what the developer wants, of course), but also by enabling the structure members to be named as they are initialized. In C++20, echoing the C++ requirement that constructors of embedded members be invoked in the order of appearance of said members, the named structure members must be specified in order of appearance. The resulting wrapper function isn’t much more intuitive-looking than the 1970s-era K&R edition, but your mileage may vary.

CUresult cuMemcpy2D_2DDto3DA( CUarray dstArray, size_t offsetX, size_t offsetY, size_t offsetZ, CUdeviceptr src, size_t srcPitch, size_t WidthInBytes, size_t Height )
{
  CUDA_MEMCPY3D cp = {
    .dstMemoryType = CU_MEMORYTYPE_ARRAY,
    .dstArray = dstArray,
    .dstXInBytes = offsetX,
    .dstY = offsetY,
    .dstZ = offsetZ,
    .srcMemoryType = CU_MEMORYTYPE_DEVICE,
    .srcDevice = src,
    .srcPitch = srcPitch,
    .WidthInBytes = WidthInBytes,
    .Height = Height};
  return cuMemcpy3D( &cp );
}

Next: we’ll take a look at applying this API design methodology to BLAS.

[1] For reference, CUDA 1.1 added streams and events, which have proven to be useful and future-proof over the last 15 years.

[2] They must not have minded that much, because cubemaps and mipmapped textures were added sometime after I left NVIDIA. Maybe OptiX needed them.

[3] With benefit of hindsight, it seems ridiculous, but in the mid-1990s some game developers wanted to keep rasterization on the CPU, where they could retain complete control. They prided themselves on their ability to write optimized software renderers, and some were loath to get rid of their most valuable IP. Reportedly it took an in-person visit by 3Dfx to convince John Carmack that the novel visibility methods he’d developed for the Quake engine could be combined with hardware-accelerated rasterization to make an even better product.

[4] “Hardware abstraction layer,” a term that I believe originated with Dave Cutler’s original Windows NT kernel, whose HAL enabled the NT kernel to be portable across CPU architectures. Cutler’s HAL ran in kernel mode and abstracted operations like mapping page tables, and ours ran in user mode and abstracted operations like drawing triangles.

[5] We had shipped several version of DirectX before we realized that sample code should be written by the best developers available, since that is the code that developers copy and paste to build their own applications.

[6] Interface designers must account for their clients’ needs, or no one will use their APIs. The metric is ROI: the interface must deliver a return on the investment that justifies the cost of learning the interface. The first version of Direct3D that used “execute buffers” failed in this mission – it was too difficult to use, and the promised benefits of execute buffers were never delivered – and almost caused the API to be cancelled.

[7] The CUDA driver treats page-locked and pageable memory differently, but those differences generally are transparent to the API client.

[8] As it happens, the CUDA runtime did expose multiple entry points to cover some of the different use cases, for example, cudaMemcpy2D() and cudaMemcpy2DFromArray().

[9] When interface designers create work for their clients, you may hear that referred to as “reverse delegation.” Sometimes it is the best choice for an interface, especially if different clients may want to do the operation differently. In the case of computing offsets into 2D linear arrays in device memory, we decided to take offset parameters and do the pointer math on behalf of our client.

[10] Another decision we had to make for this function was whether to embed the CUstream parameter for asynchronous variations. For orthogonality with the existing APIs for asynchronous memcpy, I kept both the functions and the stream parameter in separate functions, cuMemcpy3D() and cuMemcpy3DAsync().

Standardized Or Proprietary? Direct3D and OpenGL Shed Light On The Tradeoffs

May 13, 2021Uncategorizedadmin

As the industry absorbs the implications of the Google v. Oracle ruling, which essentially put proprietary APIs on the same footing as standardized ones, understanding the differences between standardized and proprietary APIs will become more important. The differences between closed and open source software have been well-plumbed; the differences between proprietary and standardized APIs, less so.

I’m a veteran of two API competitions that shed light on the tensions between the philosophies of proprietary and standardized APIs and the way they intersect with market realities: OpenGL/Direct3D and CUDA/OpenCL[1]. I’ve always been on the proprietary side of these competitions: Direct3D is proprietary to Microsoft Windows, and CUDA is proprietary to NVIDIA.

This blog entry will discuss the OpenGL/Direct3D history from my (admittedly incurably biased) standpoint as the former development lead for Direct3D.

The story does not begin with Direct3D, but OpenGL: In 1994, when I hired into Microsoft, my first project was to port the flagship Softimage application from Silicon Graphics (SGI) workstations to Windows NT, a feat that was made possible because both platforms could run OpenGL. Both Microsoft and SGI had been founding members of the OpenGL Architectural Review Board. OpenGL treated the API design as a commons: hardware suppliers collaborated on a standard, recognizing the value of a standardized API for developers to write their applications, then competed for market share on price, performance, and feature set. Since SGI had achieved early market success in selling workstations that were specifically designed to run graphical applications, the decision to work with its competitors to develop OpenGL was, essentially, altruistic[2].

At Microsoft, we were surprised at SGI’s decision. SGI had a market-leading position in the graphics workstation business, and applications written for their workstations used a proprietary API called IRIS GL. IRIS GL was not particularly well-designed, but IRIS GL applications were heavily dependent on the API. From our porting work on Softimage, we knew it would have been prohibitively expensive to modify the Softimage application to run on a different API; in fact, our porting strategy involved wrapping IRIS GL in an emulator that made the Softimage application believe, for all intents and purposes, that it was running on an SGI workstation[3].

Windows itself was a set of proprietary APIs that Microsoft was evolving quickly, to address new markets like workstations and servers. In 1992, the first year that Microsoft’s new Windows NT operating system shipped, Microsoft had become a founding member of the OpenGL ARB because they had an eye on the graphics workstation market. Although proprietary to Microsoft, Windows NT had the advantage of running on workstations from many different vendors. Unlike traditional workstation vendors, who built complete systems from hardware to operating system software to APIs, Microsoft continued the model it had perfected in the PC market, partnering with hardware vendors to build computers and licensing Windows to run on them. As a result, Microsoft was able to offer a more diverse set of products[4], from inexpensive, lower-performance workstations to high-end workstations with multiple CPUs, with the added benefit that the different hardware vendors also were competing on price. When Microsoft bought Softimage, the software cost $30,000 and ran on an $50,000 workstation. But when Softimage for Windows NT launched, Intergraph bundled the software and its high-end, dual-processor workstation hardware together for $25,000. SGI was no match for the PC workstation industry; its market capitalization collapsed from $7B in 1995 to $120M in 2005, when its stock was delisted[5].

For Microsoft, in the mid-1990s, OpenGL presented both an opportunity in the workstation business and a challenge in the gaming industry. The video card market for Windows computers was in rapid flux, as companies like S3, ATI, Trident, and others competed to deliver higher-performance Windows graphics cards that implemented operations like line- and rectangle-drawing. Furthermore, dozens of startups (such as 3Dfx and Rendition) were building hardware to address not only that market, but also 3D rendering – but unlike the workstation vendors who had co-founded the OpenGL ARB in 1992, these companies were focusing on the PC gaming business.

Gaming on Windows needed especial attention. A program manager at Microsoft named Eric Engstrom identified games (such as DOOM, id Software’s then-biggest commercial success) as the principal reason that Windows customers still needed to run DOS applications. Since continuing to support DOS was very difficult and presented an ongoing support burden to Microsoft, Eric got significant funding to develop a family of proprietary Windows APIs for gaming called DirectX. DirectInput would control devices like mice and keyboards (which, for games, look very different than for office productivity applications); DirectSound would control the audio hardware; DirectDraw enabled developers to control display hardware; and Direct3D would work with DirectDraw to enable 3D rendering[6].

Microsoft’s proprietary Direct3D, which only worked on Windows, therefore competed with Microsoft’s standardized OpenGL, which was portable to platforms other than Windows. Since the DirectX team reported through Windows 95 Vice President Brad Silverberg and the OpenGL team reported through Windows NT Vice President Jim Allchin[7], this set the stage for a political battle within Microsoft that would strain relationships, undermine Microsoft’s public messaging, and change the course of many developers’ careers, including mine. As the inevitable restructuring of Windows 95 assets into the Windows NT organization unfolded, Microsoft’s OpenGL team began openly lobbying our common management structure to discontinue Direct3D development and adopt OpenGL as the standard API. The Direct3D team, for which I was the development lead, was vulnerable to this line of argument, not only because we were wracked by reorganization issues (Eric Engstrom, anticipating that there would never be a home for him in the Windows NT organization, went to work on a browser-related project called Chrome and took most of the DirectX team with him), but because we were busy rehabilitating our API to be more user-friendly. It did not help that prominent figures in the game development community, like John Carmack and Chris Hecker, joined in public calls for Microsoft to drop Direct3D and instead use OpenGL.

Over the course of the next few months, this debate with my own Microsoft colleagues prompted a deep interrogation of the tradeoffs between proprietary and standardized APIs. OpenGL was a more palatable API, having emerged from a collaborative design effort involving multiple companies; but it seemed to me that the design deficiencies in Direct3D could be addressed simply by improving on its design. Furthermore, the recently completed Softimage port had required intensive use of OpenGL, so I had an excellent understanding of my competition.

Most of the differences were superficial. OpenGL was chattier (i.e., required more API calls to accomplish the same operation) and more stateful (i.e., kept more state in its context from which API methods could make inferences). OpenGL was more strongly-typed, meaning the elements of the state vector for a drawing context had explicit types and methods that operated on those types. Direct3D used the Common Object Model (COM), while OpenGL used a more familiar handle-based API. These superficial differences were a simple by-product of the environments where the APIs were developed.

But the APIs also included more substantive differences underpinned by real software engineering and business tradeoffs, like the way OpenGL required each hardware developer to implement the entire API[8], and emulate any functionality missing from the hardware. Direct3D, in contrast, provisionally allowed partial hardware support and did not require emulation of missing functionality[9]. OpenGL also surfaced fewer implementation details to its clients, with the idea that hardware vendors would be better able to optimize the application if the developer delegated as many tasks as possible – an explicit rejection of the “Direct” in “DirectX.” Finally, OpenGL “extensions” allowed any OpenGL implementor to define new interfaces; in turn, developers could query for these extensions and use them when available[10].

Many of these differences were informed by the differing objectives of the sponsoring organizations. Hardware vendors, who collaborated on OpenGL’s design, wanted the maximum degree of control over the implementation so they could differentiate their products through software as well as hardware. Microsoft just wanted to build a stable platform with as many hardware vendors as possible.

My Direct3D code base had structural advantages from a software engineering standpoint. A key difference between Direct3D and OpenGL was the amount of common code being executed by their client applications. Because OpenGL implementors were responsible for the entire API, there was a missed opportunity to consolidate the engineering and test resources to write and validate code that any OpenGL implementor would have to build, like memory allocators. In contrast, Direct3D defined an internal interface called the Hardware Abstraction Layer (HAL) that hardware vendors could code against. For Microsoft, the goal was to enable any hardware vendor to enter the Direct3D market by writing only code for their specific hardware, not an entire OpenGL implementation. In turn, a great deal of code written by Microsoft could be interposed between the application and the HAL, resulting in less variability across hardware from different vendors[11].

Also, with the HAL, Microsoft could ship new API features without the hardware vendors doing any additional work – in fact, that is exactly what we did with the first release of the DrawPrimitive API. When we first shipped DrawPrimitive for Christmas 1997 games, it was able to run on dozens of millions of PCs that already had sold into the PC marketplace.

The amount of competition, and the pace of innovation in this space, cannot be overstated. The capabilities of graphics chips were expanding so quickly that, as API designers, we had trouble keeping pace. Between 1997 and 2001, Microsoft released four major versions of Direct3D as transistor budgets exploded from 130,000 (for the best-selling S3 ViRGE chipset, c. 1997) to some 57 million (for NVIDIA’s GeForce 3 chip, c. 2001)[12]. Direct3D 5.0 introduced the DrawPrimitive API; Direct3D 6.0 introduced support for multiple textures, optimized geometry processing (written by Intel and AMD for their new instruction sets SSE and 3DNow!, respectively), automatic texture management, stencil buffers, texture compression, and numerous other features; Direct3D 7.0 introduced geometry acceleration; and Direct3D 8.0 introduced vertex and pixel shaders. Over the four-year span, the hardware had transitioned from simple 2.5-D triangle rasterizers attached to VRAM, to full-blown parallel processors. I did not, and do not, believe that the OpenGL ARB’s deliberative, slow-paced process could keep up with the pace of innovation we were seeing in the PC graphics space.

There were attempts to reconcile the two warring camps. In 1998, Jay Torborg began the Fahrenheit project, partnering with SGI and HP to create yet another graphics API[13]. The stated objective for Fahrenheit was to deliver replacement APIs for both the low-level “immediate mode” Direct3D and OpenGL APIs and the higher-level scene graph API. But the deal took months to negotiate – valuable time that the DirectX team spent working on their code – and suffered a setback because Microsoft’s Legal Department had not been included in the negotiations. Once the terms had been finalized and Jay brought the agreement to Microsoft Legal, he was told it could not be approved because it was anticompetitive. The agreement he’d hammered out with SGI held that Microsoft would only target game applications with Direct3D and leave workstation applications to OpenGL, an anticompetitive practice known in antitrust law as “dividing the market.” At the time, the United States v. Microsoft antitrust litigation was in full swing, so the Legal Department was exceptionally sensitive to the prospect of inviting further antitrust liability.

Over those four years (1998-2001), industry consolidation reduced the number of players to a handful. By the mid-aughts, NVIDIA and ATI had become a de facto duopoly in the discrete graphics chip market (chips designed to be installed on graphics cards that get plugged into the bus), and, after AMD bought ATI in 2006, Intel and AMD quickly become a de facto duopoly in the integrated graphics chip market (with graphics functionality on the same integrated circuit as the CPU). The hardware innovations also greatly reduced the amount of common code that Microsoft could interpose between the application and the driver. In 1998, it made sense for Microsoft’s runtime to execute the geometry pipeline on the three variants of x86 instruction set then available (x87, 3DNow!, SSE); by 2001, that functionality had been subsumed into hardware. With the addition of vertex and pixel shaders, Microsoft could deliver some value in applying compiler optimizations to the shaders that developers had written; but only the hardware vendors could translate that intermediate code into instructions for their graphics chips. So as the market consolidated, the relative sizes of the engineering investments by Microsoft and the graphics chip vendors shifted in favor of the latter.

One of the side effects of this market consolidation was that the barrier to entry for an OpenGL implementation stopped being prohibitive for the market participants. None of the players were tiny startups anymore; all of them could afford to develop formidable OpenGL implementations in addition to the Direct3D driver development they were doing for Windows. And, as it happens, the natural market divide between OpenGL and gaming applications – the one that Jay got in trouble with Microsoft Legal for trying to codify in the Fahrenheit partnership – was deep enough that both APIs thrived in their respective domains. Gaming developers tended to have more nimble code bases, and did not mind that DirectX only ran on Windows; developers of workstation applications valued OpenGL’s portability to non-Windows platforms, and did not necessarily require that the API formally support the latest hardware innovations. OpenGL hardware developers could use the extensions mechanism to add features, including standardized versions of DirectX features, before the ARB formalized support in a new version of the API. OpenGL application developers could accommodate the differences between their target vendors’ implementations.

Established in the early 2000s, the détente has lasted to the present day, with Direct3D the preferred API for games on Windows (and the Xbox, once it became available) and OpenGL the preferred API for workstation applications. Direct3D remains a proprietary API, wholly owned and managed by Microsoft for Microsoft platforms, and OpenGL remains a standardized API, periodically revised by its ARB, with an extensions mechanism, and implemented in its entirety by each vendor. One recent development of interest is that in 2018, Apple transitioned from standardized APIs (OpenGL and OpenCL for graphics and GPU computing, respectively) to its proprietary Metal API.

I personally had moved on to work on GPU computing technologies even before the OpenGL-Direct3D détente had become well-established. By the early 2000s, it was clear to some industry observers that GPUs were going to be an important computational resource. The transistor counts in NVIDIA’s GPUs had crossed over Intel’s transistor counts for CPUs in 1998 at 8M (the NVIDIA RIVA TNT and Intel Pentium 2, respectively) and never looked back[14]. A whole class of applications called GPGPU (“general-purpose computing on GPUs”) had been developed that involved writing OpenGL or Direct3D code that performed parallel computing tasks instead of rendering 3D scenes. The limitations of the earliest GPU hardware actually prompted some enterprising GPGPU developers to devise ways to perform integer computations with floating point hardware! After a stint in Microsoft Research, I left Microsoft for NVIDIA in 2002, having concluded that it would not be possible to build a GPU computing platform at Microsoft[15]. Finally, in early 2005, NVIDIA began work in earnest on CUDA.

[1] We could just as easily say CUDA/OpenACC, but for purposes of this discussion, we’ll refer only to one of the competitors to CUDA.

[2] While at Microsoft, my supervisor Jay Torborg, manager of the Talisman project and previously the co-founder of graphics startup Raster Technologies, claimed that he was the one to persuade Kurt Akeley, the SGI co-founder, to support OpenGL.

[3] It’s interesting to compare and contrast this porting strategy and the implementation strategy adopted by Google for the Java SE runtime that is the subject of the Google v. Oracle case. We modified Softimage to translate IRIS GL calls into OpenGL calls and, when necessary, native Windows API calls, to implement the IRIS GL calls. Google’s approach was designed to ensure that Java applications would not need to be modified so extensively in order to run on their new platform.

[4] In fact, Windows NT originally ran on different CPU architectures! Our Softimage port may have been the only commercial application shipped by Microsoft that ran on all three CPUs supported by Windows NT: Intel, MIPS and DEC Alpha. During the 18 months or so it took to port the application, Intel leapfrogged MIPS to go from the value play (lowest price, lowest performance) to the less expensive, second-highest-performance play. Alpha support was cancelled before Intel’s CPUs could overtake them in performance.

[5] Notably, Netscape co-founder Jim Clark, who was instrumental in getting the Justice Department to pick up the antitrust investigation into Microsoft that had been abandoned by the FTC, also was a co-founder of Silicon Graphics. As someone who had at least two lunches eaten by Microsoft, it would be an understatement to say that he had an axe to grind.

[6] The prefix “Direct” stems from Eric’s observation that game developers all wanted “direct control” over the hardware. They believed performance was the main predictor of market success, and they did not trust Microsoft or anyone else to get between them and the hardware. In fact, it was difficult to convince game developers to allow hardware to perform 3D rendering tasks on behalf of their applications.

[7] Believe it or not, the organizational divisions were deep enough that Windows NT had little to no code in common with Windows 95. One could argue Microsoft built a clean-room version of their own operating system! In fairness, when Dave Cutler and his team originally came to Microsoft in 1987, they intended to build a microkernel operating system that was agnostic in its support for Windows, OS/2, or UNIX. It was after a few years of development that Microsoft publicly split with IBM and put all its chips in the Windows NT box. See e.g. https://www.nytimes.com/1991/07/27/business/microsoft-widens-its-split-with-ibm-over-software.html

[8] The term of art for this API architecture is “installable client driver” (ICD). The host operating system provides only a minimal interface to its windowing system, where the pixels rendered by the OpenGL implementation would appear. Everything relating to OpenGL – not just hardware-specific code, but also potentially general-purpose code like memory allocators and math library functions – must be implemented in the ICD.

[9] Microsoft never would have allowed hardware developers to write their own software rasterizers – we would’ve written the emulation code ourselves. In fact, a stillborn OpenGL driver model from Microsoft called the Mini Client Driver (MCD) was the OpenGL team’s attempt to interpose a HAL between the API and its drivers, but it achieved limited adoption because drivers that used MCD were so much slower than ICDs.

[10] Since Direct3D was being revised so frequently, we considered an API revision to be the extensions mechanism. A recurring theme of “OpenGL v. Direct3D” conversations of the day was for OpenGL advocates to say that OpenGL had a feature if it was available on any OpenGL implementation, via the extensions mechanism; Direct3D advocates would counter that a feature didn’t count unless it was officially available in the flagship API.

[11] A sad truth about API implementation that may be the subject of a future blog: applications depend on the behavior of an API implementation, not its adherence to the API specification. Too often, that means fixing bugs breaks backward compatibility. It has taken decades for the software industry to learn how to build APIs that evolve seamlessly, while not breaking incumbent applications.

[12] While at Microsoft Research, I built an early GPU computing application that used Direct3D card to compute Hough transforms for line detection. That was when I realized GPU computing wasn’t going to happen at Microsoft – I’d have to go somewhere with both hardware and software expertise to do that. Somewhere like NVIDIA!

[13] It’s just now that I am connecting the dots that Jay also claimed to be the one who convinced Kurt Akeley to support OpenGL in the early 1990s.

[14] By 2001, Intel’s Pentium 4 “Willamette” processor had 42M transistors, but NVIDIA’s GeForce3 had about 60M. In 2013, Intel’s first Haswell processors had about 1.4B transistors to 3.5B transistors for NVIDIA’s Kepler chip. And that was not even NVIDIA’s “win” chip for the Kepler architecture! The GK110 processor (2014) had 8B transistors.

[15] Microsoft did not yet have the requisite hardware design expertise. With the hardware team behind the Xbox, that may have changed now.

Google v. Oracle: Vast Industrywide Implications For Years To Come

April 28, 2021Uncategorizedadmin

As a professional software developer, I am not accustomed to court cases, let alone Supreme Court cases, addressing the tradecraft of my profession; and when they do, I’m usually disappointed in the outcome. But this week, the Supreme Court issued a decision that directly addresses, not just computer programming in general, but API design, which I have worked on for decades, and as it happens, I agree with the Court’s ruling and look forward to the wide-reaching impact it will have on my profession[1].

This ruling will reverberate throughout techdom for many years to come: The Supreme Court ruled that the header files defining an interface are subject to fair use as a matter of law. (I am not an attorney, but apparently that means the Supreme Court held that a reasonable jury could not find in favor of the opposing party.)

Fair use is the carveout in copyright law that enables works to be excerpted without permission and without penalty, for purposes that serve a social good like satire or political commentary. The Supreme Court ruled that the declaring code for Java SE is subject to fair use. What these files do is describe what functionality may be requested of a body of software; they deliberately omit the details of how that functionality might be delivered, because even 30 years ago there was a recognition that it was important to be able to build software that could work together despite implementation details changing.

To put numbers to it, Google copied 11,500 lines of header files that described the Java SE runtime interface, then wrote about 2,800,000 lines of code to implement the underlying functionality in such a manner that code written to utilize the interface could run in a new context (the Android operating system). I personally have written several header files that achieved widespread use. While at Microsoft in the mid-1990s, I wrote the header files for Direct3D, which almost 25 years later is still the interface that developers use to access 3D rendering functionality on Windows[2]. While at NVIDIA, I wrote the header files for the “driver API” for CUDA, the technology that enabled NVIDIA to turn their graphics chips into general-purpose supercomputers. The headers for Direct3D and CUDA are directly analogous to the ones referenced by the Supreme Court decision: they deliberately elide implementation details, outlining important abstractions and the capabilities and restrictions that apply to how those abstractions interact, without disclosing how those capabilities are delivered.

Even Proprietary APIs Are A Commons

My first reaction to the ruling is that it takes a big step toward conflating proprietary APIs and APIs that are defined by standards bodies (“standardized APIs”). For purposes of this discussion, proprietary APIs are ones developed by singular corporate entities and protected by the usual patchwork of trademarks, copyright, and patents to protect the intellectual property.

An early example of a standardized API is OpenGL. In 1992, the OpenGL Architectural Review Board (ARB) was convened by key technical contributors from various 3D workstation companies. These companies were building specialized computer hardware to accelerate realistic, three-dimensional pictures. With such hardware, such pictures could be drawn so quickly (say, 30 frames per second) that the user could manipulate 3D objects interactively. This 3D rendering hardware enabled the development of applications like Softimage, which was used to animate the realistic dinosaurs in the movie Jurassic Park.

OpenGL treats the API design as a commons – participating vendors collaborate on a standard, recognizing the value of a standardized API for developers to write their applications – then compete for market share on price, performance, and feature set. The workflow followed by OpenGL ARB members is akin to that undertaken by Google to implement the Java SE runtime: each company is responsible for building its own full-scale OpenGL implementation, complete with emulation for features not yet implemented in the hardware. Vendors of OpenGL hardware consider their OpenGL software implementations (which, like Google’s implementation of the Java SE runtime, number in the millions of lines of code) to be at once a useful barrier to entry, and also a way to differentiate their products from the competition.

Adopting the Supreme Court ruling’s terminology of referring to the code that describes the interface as “declaring code” and the code that contains actual computer instructions as “implementing code,” the ruling seems to imply that any company capable of writing the “implementing code” for a given proprietary API may consider doing so with impunity. There is no question in my mind that both the CUDA headers and Microsoft’s Direct3D headers fall under the category of “declaring code” as defined by the Supreme Court ruling. Provided an implementation meets the criteria outlined in the ruling, it seems the Supreme Court has given the go-ahead for clean room implementations of any proprietary API.

[1] The ruling is not perfect. On page 2, it states that “Computer programs differ to some extent from many other copyrightable works in because computer programs always serve a functional purpose.” They couldn’t insert the word “usually” in there somewhere? I take strong exception to the idea that all of our work serves a functional purpose. What about The International Obfuscated C Code Contest? Code can be art, too!

[2] Students of multimedia history will note that I was late to this party – the original Direct3D API was developed by a company called RenderMorphics, which had been purchased by Microsoft. But their original “execute buffer” based API was so difficult to use that Microsoft seriously considered switching to a different API called OpenGL. My contribution was to lead development of the “DrawPrimitive” API that changed the course of that discussion. The Wikipedia article on this topic has some references (full disclosure, I am a contributor). https://en.wikipedia.org/wiki/Comparison_of_OpenGL_and_Direct3D#Direct3D

CUDA Graphs, ROI, and API Adoption

September 27, 2018Uncategorizedadmin

CUDA 10 adds a new API called “CUDA Graphs” that are immediately familiar to graphics API designers: they are a scene graph API for compute. Scene graph APIs enable developers to describe geometry at a “higher”™ level, in ways that express the relationships between, say, rooms and doorways within a castle or the arms and legs of a 3D character. The idea is that with this additional information, the API implementor (in this case, NVIDIA) can write code that will traverse the scene graph (say, rendering the characters with their limbs animated) more efficiently than code written by the developer. Either that, or the scene graph API is sufficiently easier to learn than learning how to write the scene graph code that developers can achieve faster time-to-market by learning and using the scene graph API.

I am skeptical that CUDA Graphs will achieve adoption outside NVIDIA’s SDK samples.

API designers drive adoption by maximizing the return on investment, where the return is efficient, working code and the investment is developer time. APIs that are not easy to learn are disadvantaged because every developer who writes or maintains the code must invest in learning the API. APIs that don’t deliver a compelling performance advantage must be *very* easy to learn, hence conferring an expressive advantage. (i.e. faster development times.)

CUDA adoption has been driven by delivering huge performance gains (the return) despite a steep learning curve (the investment). (It makes for an interesting thought-experiment to wonder why CUDA has succeeded and other manycore platforms have not. Although this blog post does not touch on the issue, customer investments must be considered in addition to developer investments.)

An early API (in fact, it was created in the 1970s, long before the term “API” had been invented) that delivers high ROI is BLAS, the Basic Linear Algebra Subprograms. Originally written in FORTRAN, the motivations for this library were twofold: to “provide names and argument lists that might become widely used and recognized for some of the basic operations of computational linear algebra,” and “to improve efficiency of math software.” BLAS code is reasonably performance- and platform-portable. As the underlying platforms evolved, the same BLAS code benefited transparently from assembly language hand-coding to cache blocking to SIMD instruction sets. There was no need to update the API client code as the implementation changed underneath. BLAS has achieved widespread adoption in numerical code, amplifying developers’ expressive power and enabling them to leverage the development effort invested by others in its implementation. At this point, BLAS gets an inordinate amount of attention from hardware vendors, making it unlikely that developers can match its performance without exploiting a priori knowledge of their application requirements. It takes time to learn, but delivers a considerable return on that investment.

On the other end of the spectrum, an API that has high ROI by minimizing developer investment is malloc()/free(). Learning first-hand the difficulty of writing a fast, robust memory allocator has been an inflection point in many junior developers’ careers – it’s harder than it looks. Other APIs that deliver a high return with minimal investment: the thread synchronization APIs built into operating systems. They are not hard to learn and, for most developers, impossible to implement.

In the early days (DirectX 2.0-3.0), Direct3D had a scene graph API called the “retained mode,” but the last version shipped in 1996. No one was using it, despite heroic evangelism efforts by its developers. Developers could use “immediate mode” APIs to implement their own scene graphs more efficiently – both in terms of developer time and in terms of high-performance implementations of the operations they needed. As an added bonus, by writing the scene graph traversal themselves, developers kept all the IP in-house (e.g., their visibility algorithm) and, if there was a bug, they could fix it in their code on their own schedule.

Since game developers co-design their content development tools with the runtime, a great deal of intellectual property is encapsulated in the scene graph traversal. In a sense, 3D scene graph API designers were aspiring to co-opt developers’ core IP – never a winning proposition for a platform.

I suspect that CUDA developers will come to similar conclusions with the CUDA Graphs. No one will use them unless they deliver a return on investment in the form of higher performance, or greater expressiveness commensurate with the effort to learn the APIs. Higher performance will be difficult to achieve since CUDA gives developers ready access to the underlying tools used by the CUDA Graphs.

One possible opportunity for NVIDIA: perhaps CUDA Graphs will be an efficient way to enable concurrent execution of kernels that weren’t designed to run in streams? CUDA streams are like const correctness – it is difficult to retrofit code to use them because they must be plumbed into interfaces from top to bottom. An alternative to revisiting interfaces top-to-bottom is to add a “current stream” API (as CUBLAS did), but current-anything APIs interoperate poorly and tend to be inefficient at changing the current-thing. More importantly, the current-thing state must be saved and restored across interfaces.

So one path to adoption for CUDA Graphs may be an efficient way to enable concurrent execution of kernels that weren’t designed to use streams. But in general, like immediate-mode graphics APIs, most developers will be able to more quickly write their own code expressing the dependencies in their application than it would take to learn and use the CUDA Graphs APIs. And developer-authored code will run at least as fast, paying tribute to the First Law Of CUDA Development.

Unless CUDA Graphs deliver a high ROI, they will go the same way as other features that Seemed Like A Neat Idea At The Time, like dynamic parallelism and managed memory.

Don’t Move The Data!

October 6, 2017Uncategorizedadmin

NVIDIA just delivered their first Volta-enabled DGX-1 systems – great news for those who need the additional compute power of GV100 versus GP100:

	GP100	GV100
FP32 Compute	10.6 TFLOPS	15.0 TFLOPS
FP64 Compute	5.30 TFLOPS	7.50 TFLOPS
Memory Bandwidth	720 GB/s	900 GB/s

Wait, you say, that’s an interesting qualifier. Who doesn’t “need the additional compute power…?” Did someone hack into Nick’s blog account and post on his behalf? Or has he become a Luddite in his dotage?

Nope, no, I still think more compute is generally better; but it is past time to question the architecture of these systems with huge, discrete GPUs connected to the world by buses. The problem with DGX-1 is that those GPUs are hungry! They need to be fed! And they can only sip data through the tiny soda straw known as the PCI Express bus.

For perspective, let’s compare these chips to G80, the first CUDA-capable GPU. Let’s set the stage by observing that G80 was the largest ASIC NVIDIA could feasibly design and fabricate in 2006, straining the limits of contemporary fabrication technology – a classic “win” chip. It had 684M transistors, a theoretical maximum performance of 384GFLOPS for single precision, and no support at all for double precision. GP100 and GV100 respectively have 22x and 31x more transistors, and 27x and 39x more single precision performance than G80. But the bandwidth to deliver data to and from these GPUs has not been increasing commensurately with that performance.

Here’s a table for all 3 GPUs – G80, GP100 and GV100 – that highlights the FLOPS/byte of bandwidth for device memory (attached to the GPU), NVLINK (NVIDIA’s property GPU-GPU interconnect), and PCI Express:

	G80	GP100	GV100
GFLOPS (SP)	384	10600	15000
GPU↔GPU memory	84 GB/s	720 GB/s	900 GB/s
FLOP/Byte	4.5	14.7	16.67

GPU↔GPU	n/a	20 GB/s	20 GB/s
FLOP/Byte		530	750

CPU↔GPU	3.1 GB/s	3.1 GB/s	3.1 GB/s
FLOP/Byte	124	3419	4839

The 3.1GB/s figure comes from dividing the available PCIe bandwidth by the number of GPUs in the system. Two 16-lane PCIe 3.0 connections are about 25 GB/s observed, and there are 8 GPUs.

As the number of FLOPS per byte of I/O diverges, the number of workloads that benefit from more FLOPS diminishes. Googling around for literature on FLOPS/byte, I ran across this 2011 presentation by Peter Kogge entitled “Hardware Evolution Trends of Extreme Scale Computing.” For anyone in the GPU business, the first sign that something’s amiss crops up in Slide 3, which cites “1 byte/FLOP” as the “classical goal.” Even G80’s device memory fell well short of that goal with 1 byte/4.5FLOPS. I prefer this framing because it adopts the viewpoint of scarcity (bytes/FLOP – getting data in and out for processing) rather than abundance (FLOPS/byte – having lots of processing power to bring to bear on data once it is in hand).

The presentation is from 2011, but still very relevant: after reviewing Moore’s Law and the rise and fall of Dennard scaling, and the preeminent importance of power dissipation in modern computing, the concluding slide reads in part:

World has gone to multi-core to continue Moore’s Law
Pushing performance another 1000X will be tough
The major problem is in energy
And that energy is in memory & interconnect
We need to begin rearchitecting to reflect this …
DON’T MOVE THE DATA!

“DON’T MOVE THE DATA” has been good advice to everyone who’s had the data for decades (in 1992 I wrote a Dr. Dobb’s Journal article that focused on hand-coding x87 assembly to keep intermediate results in registers)… but the advice has more currency now.

Moving The Data on CPUs

The data/compute conundrum finds expression on modern multi-core CPUs, too. Each core on a modern x86 CPU has ILP (instruction level parallelism) of 5, meaning it can detect parallelism opportunities between non-dependent instructions and execute up to 5 instructions in a single clock cycle. Latency to the L3 cache is about 50 clock cycles. So a CPU core can perform dozens of FLOPS on data in registers during the time it takes for the L3 to service a load (conservatively – 2 of the 5 pipelines can do 8 FLOPS per instruction via AVX). And that’s assuming the data was in cache!

As an aside, this observation helps explain why “optimized” numerical Python code is still dead slow. Python is interpreted, so has a library called Numpy that wraps vectorized implementations of operations that do things like element-wise addition or multiplication between arrays. But for arrays that don’t fit in cache (and to some extent, even arrays that do fit in cache), it is very inefficient to do multiple passes over the data if the computation could have been fused into a single pass. The code spends all of its time moving data, and very little time processing it.

DON’T MOVE THE DATA!

A Gift From Heaven: Deep Learning

Which workloads, pray tell, require endless FLOPS per byte of I/O? Or turn it around and ask, which workloads still thrive when there is barely any I/O per FLOP? NVIDIA hasn’t been shy about trumpeting its solution to this problem: deep learning! Training a deep learning network entails refining floating point weights that roughly represent neurons that “learn” as they are trained on the data. As long as the weights can reside in device memory, only a modest amount of I/O is needed to keep the GPU busy. In retrospect, NVIDIA is extremely fortunate that deep learning cropped up. Without it, it’s not clear what workload could soak up all those FLOPS without the GPUs starving. The importance of machine learning as a workload helps explain why GV100 contains purpose-built hardware for machine learning, in the form of the TensorCore. But that hardware actually exacerbates the GPU starvation problem, by increasing FLOPS without increasing bandwidth.

NVIDIA probably isn’t comfortable betting the farm on a single workload – especially one where their main customers are enterprises that can invest in their own machine learning hardware and that is attracting VC money for application-specific hardware. How do you hedge? How can NVIDIA relieve the bottleneck? Unless some workload materializes that is as compute-intensive (per byte of I/O) as machine learning, NVIDIA must seek out ways to address their GPUs’ I/O bottleneck.

I/O: NVIDIA’s Strategic Landscape

The problem confronted by NVIDIA is that they are hindered by some business and legal challenges. According to the terms of their 2011 settlement with Intel, 1) They do not have a license to Intel’s industry-leading cache coherency protocol technology, and 2) they do not have a license to build x86 CPUs, or even x86 emulators.

NVIDIA has done what they can with the hand they were dealt – they built GPUDirect to enable fellow citizens of the bus (typically Infiniband controllers) to access GPU memory without CPU intervention; they built NVLINK, a proprietary cache coherency protocol. They have licensed NVLINK to IBM for the POWER architecture and signaled a willingness to license it to ARM licensees. The problem is that POWER and ARM64 are inferior to Intel’s x86, whose high-end CPU performance is unmatched and whose “uncore” enables fast, cache coherent access across sockets. NVIDIA itself, though an ARM licensee, has announced that they will not be building a server-class ARM chip.

I’m not sure why NVIDIA announced they would not be building their own ARM to drive their GPUs, because that seems like an obvious way for them to own their destiny. It may be that NVIDIA concluded that ARM64 cores simply will never deliver enough performance to drive their GPUs. That’s too bad, because there is a lot of low-hanging fruit in NVIDIA’s driver stack. If they made the software more efficient, it could either run faster on the same hardware or run at the same speed on lesser hardware – like ARM64 cores.

Not being able to coordinate with Intel on the cache coherency protocol has cost NVIDIA big-time in at least one area: peer-to-peer GPU traffic. Intel could, but chooses not to, service peer-to-peer traffic between NVIDIA GPUs at high performance (Intel and NVIDIA give different stories as to the reason, and these conversations happen indirectly because the two companies do not seem to have diplomatic relations). As things stand, if you have a dual-CPU server (such as NVIDIA’s own DGX-1) with cache coherency links between the CPUs, any peer-to-peer GPU traffic must be carefully routed past the CPUs, taking care not to cross the cache coherency link. If Intel could license QPI to Altera, they could license it to NVIDIA. Failing to do so is a matter of choice and a by-product of the two companies’ respective positions in the business and legal landscapes.

As things stand, NVIDIA is dependent on Intel to ship great CPUs with good bus integration, and peer-to-peer-capable GPU servers have to be designed to steer traffic around the QPI link. The announcement that NVIDIA would not build ARM64 SOCs was done in 2014, so now that the competitive landscape has evolved (and though I can remember when Intel’s market capitalization was 12x NVIDIA’s, it is now only about 1.7x), it would not surprise me if NVIDIA revisited that decision.

One Path Forward: SoCs

One partial solution to the interconnect problem is to build a System on a Chip (SoC): put the CPU and GPU on the same die. Intel and AMD have been building x86 SOCs for many years; it is Intel’s solution to the value PC market, and AMD has behaved like their life depended on it since 2006, when they acquired GPU vendor ATI. NVIDIA’s Tegra GPUs are all ARM SoCs. The biggest downside of SoCs is that the ratio of CPU/GPU performance is fixed years before the hardware becomes available, causing workloads to suffer if they are more CPU- or GPU-intensive than the SoC was designed to address. And if the device doesn’t have enough performance, scaling performance across multiple chips may be more difficult because GPUs require such high bandwidth. A conspicuous success story for big SoCs has been in the gaming console market, where the target workload is better-understood and, in any case, game developers will code against whatever hardware is in the console.

So I suspect that as workloads continue to tap out the FLOPS and balance out the bandwidth/FLOPS, big SoCs will start to make more sense. In sizing the CPU/GPU ratio, hardware designers can create a device with the biggest possible GPU that doesn’t starve with the available bandwidth.

SoCs are just a stopgap, though. As the laws of physics continue to lower the boom, the importance of system design will continue to increase, as Kogge pointed out in his 2011 presentation. The fundamental problem of the speed of light isn’t going away… ever.

Managed Memory and Segmentation

August 28, 2017Uncategorizedadmin

At the GPU Technology Conference this year, I ran into an old colleague from NVIDIA and the topic of managed memory came up. He related that earlier in the conference, the Q&A session after a Dell-sponsored presentation on managed memory had suffered a serious decline in the level of discourse. Neither of us had been in attendance, but apparently one questioner after another stood up and asked the presenters, in effect, “Where are you going with this?”

He made it sound like the presenters thought they were going to be pelted with rotten fruit!

Why the controversy? Managed memory is supposed to make CUDA programming simpler. It is intended to eliminate the need to copy data back and forth between CPU and GPU memory. If it were performance-neutral like, say, C++ lambdas, then managed memory would be a welcome addition to CUDA. The problem is that any feature that presents a risk of degrading performance will be viewed with skepticism by the CUDA programming community – because no one programs CUDA for fun.

What is it about CUDA that makes it so difficult to automatically manage residency of memory?

Reflecting on the answer to this question brought me back to a conversation I had at a different GTC, so long ago that I’m not sure which year it occurred. I had just met Daniel Moth, the Program Manager at Microsoft for C++ AMP. Once it was firmly established that we were fellow travelers, charting the technical roadmaps for competing data parallel programming environments, he had a question.

“Tell me one thing,” he asked. “Why do you need streams and events?”

I had to think for a minute. Why indeed? I’d added the feature in CUDA 1.1, to cover new hardware that could DMA host memory concurrently with kernel execution; but it was already clear that the new abstractions were future-proof to multiple kernels executing concurrently, and even coordinating execution between multiple GPUs.

“Streams are like CPU threads,” I told him stupidly, quoting from the original design document I’d written in 2007. “Operations that are done in different streams can happen concurrently. And you need events to coordinate execution between streams.”

“But we don’t need that stuff in C++ AMP. The stuff that can be done in parallel, we just do it in parallel.”

After a few minutes’ conversation, the key difference emerged and I finally had it.

“Oh,” I cried. “CUDA has a flat address space!”

C++ AMP does not.

CUDA’s address space causes more trouble than is widely appreciated. Because pointers can be stored in device memory, any CUDA kernel can attempt to access any address. In early versions of CUDA, where paging is not supported (every byte of virtual memory is backed by physical memory) and systems with multiple GPUs were rare (and certainly not for sale in the public cloud), having an address space seemed to make sense. That impression was bolstered by the hardware design community’s ideological commitment to linear address spaces, which had taken root after a divisive debate contrasting linear address spaces with segmentation.

Linear Addressing versus Segmentation

Segmentation is the idea that memory should be modeled as a set of discrete buffers with base pointers and lengths, rather than assigning an address (like a PO Box) to each byte of memory. Segmented memory is accessed via a segment/offset tuple instead of by a single address, a paradigm that is implemented at the hardware level. Intel’s x86 architecture was segmented from the beginning (c. 1976). It provided for 4 segments to be accessible at any given time: the segment registers CS, DS, SS, and ES were for code, data, stack, and “extra” data, respectively. Each segment register had a base address and a length, and most machine instructions implicitly referenced a segment that represented a sensible default. The PUSH and POP instructions that operate on the stack implicitly referenced the stack segment (SS). Loads and stores from memory implicitly used DS, the data segment, unless that default was overridden by a “segment prefix” instruction. For example, the SS: prefix could be used to operate on stack memory.

The problem with segments was that they made code difficult to compose: even simple operations like function calls were complicated by potential differences between the segment register settings needed by the caller and callee. The callee could save and restore its segment registers at the subroutine boundary, but that hurt performance. More typically, developers would select a “memory model” with fixed segmentation usage that was appropriate for their application. So-called “large” memory models would just specify a segment:offset tuple for every address; under MS-DOS, this amounted to a cheesy way to enable 20-bit addressing with 32-bit addresses, or 1M of memory with 4G worth of address width. It also hurt performance since every load and store needed a segment override.

Segmentation introduced difficult, but solvable, problems for developers of individual applications; but even 25 years ago, it was clear that plugin architectures like OLE automation would play a central role in future software development. Being able to load code and data dynamically into an application and have it “just work,” without having to worry about segments, was of paramount importance. The ability for libraries to efficiently access their callers’ data, and process it on their behalf, overrode the concerns that buggy code could corrupt data that happened to be accessible.

Segmentation and flat addressing can be reconciled by enabling large segment offsets and having the operating system map all the segments to cover the same address range. This usage was anticipated when the Intel i386 was released in 1986, and implemented in 32-bit multitasking operating systems like UNIX (or Microsoft’s long-lost Xenix), and later, OS/2 and Windows NT. This paradigm was so popular, and the need for segmentation support in hardware so unclear, that AMD mostly did away with segment registers when they revised x86 to enable 64-bit addressing in the early aughts.

When I wrote the specification for CUDA textures, with a clear separation between memory and views on the memory (CUDA arrays and texture/surface references, respectively), it quickly became clear that CUDA arrays were effectively segmentation. A CUDA kernel can’t access just any CUDA array; the CUDA driver must predeclare the CUDA arrays to be accessed by a kernel. Coupled with other per-launch parameters, such as the amount of shared memory and the number of registers needed, a CUDA kernel launch more closely resembles a container launch than a subroutine call.

A key reason segmentation was an abject failure for general-purpose computer architectures was the high cost of “switching segments” on a per-instruction basis. On x86, instructions such as LDS (load data segment) were costly; instruction prefixes to change the segments being operated on by a given instruction added complexity; and naïve systems that kept segment:offset tuples for all pointers essentially wasted addressing bits. Now that we have 64-bit addressing, it is possible to envision having page tables play the role of segments (by introducing a byte-granular limit to page table size), as argued in this blog post. For now, however, there is a decisive consensus in favor of flat address spaces.

What does all this have to do with managed memory?

By implementing segmentation on a per-kernel basis instead of a per-machine-instruction basis, GPU computing technologies get many of the benefits of segmentation, without the costs that hindered adoption on the CPU side. Kernels may take slightly longer to launch than they would otherwise, but the cost of a kernel launch is high enough that the additional cost of segmentation is negligible. And if each kernel launch predeclares the needed segments, the system can infer residency requirements, ensure coherency, and identify parallelism opportunities, much in the same manner that superscalar CPUs use real-time dependency analysis to identify which instructions can execute in parallel.

What do you mean by “infer residency requirements,” you ask? You guessed it: managed memory!

What do you mean by “identify parallelism opportunities,” you ask? You guessed it: automatic CUDA streams!

What about coherency? Not much would change here. The CUDA driver already uses software mechanisms to enforce coherency, for example, by inserting cache-invalidate instructions into the command stream before launching kernels that read from texture. In a segmented memory architecture, read-only segments can be copied where they are needed, then discarded without having to worry about propagating changes to the data. Writeable segments could be copied back wholesale, or using dirty bit optimizations.

So, it is not hard to imagine a GPU computing technology that uses segmentation to manage memory rather than a linear address space. In fact, we do not have to imagine C++ AMP – it’s already here – and for CUDA, programs that used only CUDA arrays would have the properties needed to automate residency and parallelization. As a side note, the WDDM display driver model introduced in Windows Vista embraced a segmented memory architecture for paging.

Let’s review some of the deficiencies in managed memory, as discussed in my previous blog. It attempts to infer residency requirements based on memory accesses – which hurts performance and breaks the First Law of CUDA Development. It breaks the useful ability to infer the “owner” (CPU or which GPU) of a given address in the Unified Virtual Address Space. The semantics of multi-engine and multi-GPU memory management are complicated, and require hinting. Even if we set aside warranted skepticism about whether the hinting will be future-proof (I have my doubts), it introduces enough complexity that managed memory does not compare favorably to the static, affinitized allocations like CUDA 1.0 or segment-based architectures like C++ AMP.

Managed Memory: Belated Comments on Implementation

August 6, 2017Uncategorizedadmin

Managed memory is a new-ish CUDA feature that aspires to do away with the need to explicitly copy CPU memory to and from GPU memory. Introduced in CUDA 6.0, its initial implementation was unusably slow. (For example, copying managed memory from GPU to CPU memory ran at 512MB/s, 25x slower than an asynchronous memcpy.)

When they undertook to build the managed memory feature, NVIDIA had many different implementation strategies they could have pursued. As best I can tell, here is a summary of their implementation strategy:

Upon allocation of managed memory, the CUDA driver allocates device memory, plus a pageable range of CPU memory at the same virtual address range.
The CUDA driver use dirty page bits to track which 4K pages were “touched” by the CPU.
Upon kernel launch, the CUDA driver would unmap the managed memory from the CPU and copy the dirty pages from the CPU to the GPU. Unmapping the CPU memory removes the risk of write-after-read hazards from the CPU corrupting managed memory before the GPU was able to copy it.
While CUDA kernels run, the device memory copy of the managed memory is the only valid one.
Upon CPU/GPU synchronization, the CPU buffer is made accessible again, but is not copied wholesale from GPU memory. It is possible the GPU’s hardware does not have the same dirty bit tracking facilities as the CPU, or perhaps NVIDIA just thought it would be preferable to copy device memory back to the CPU “on demand.”
Copying managed device memory back to host memory is prompted by page faults: when the CPU attempts to access a page of managed memory, the CUDA driver handles the page fault by copying the 4K of GPU memory to CPU memory.

The application I used to investigate NVIDIA’s managed memory implementation is only about 60 lines of code. The key component is a function usPerLaunch that allocates a specified amount of managed memory, launches a NULL kernel, synchronizes with the GPU, then optionally “touches” the managed memory to force the CUDA driver to copy it back to host memory. (In an earlier version of this test, I confirmed that CUDA lazily copies only “dirty” pages in the other direction, as NVIDIA claims in its documentation.)

Here is the function in question – it may be found in cudahandbook/concurrency/managedOverhead.cu:

const size_t pageSize = 4096;
template
double
usPerLaunch( int cIterations, size_t cPages=0 )
{
    cudaError_t status;
    double microseconds, ret;
    chTimerTimestamp start, stop;
    void *p = 0;

    cuda(Free(0) );
    if ( cPages ) {
        cuda(MallocManaged( &p, cPages*pageSize ) );
    }

    chTimerGetTime( &start );
    for ( int i = 0; i < cIterations; i++ ) {
        NullKernel<<<1,1>>>();
        cuda(ThreadSynchronize() );
        if ( bTouch && 0 != p ) {
            for ( int iPage = 0; iPage < cPages; iPage++ ) {
                ((volatile unsigned char *) p)[iPage*pageSize] |= 1;
            }
        }
    }
    chTimerGetTime( &stop );

    microseconds = 1e6*chTimerElapsedTime( &start, &stop );
    ret = microseconds / (float) cIterations;
    cudaFree( p );
Error:
    return (status) ? 0.0 : ret;
}

I ran this program on a Haswell-based Windows 7 machine on two NVIDIA GPU boards: the NVIDIA GeForce GTX 970 and Titan X (GM200 and GP100, respectively). Although both are large “win” chips, I would expect similar test results to hold true across all Maxwell and Pascal GPUs, since they seem to have implemented a hardware interface that improved managed memory performance.

μs

Launch time (ms)	Memory (KB)	Bandwidth (MB/s)
47	0	–
105	4	39
104	8	78
115	16	143
134	32	244
213	64	307
381	128	344
649	256	404
1247	512	420
2221	1024	472
4712	2048	445
8458	4096	496
17041	8192	492
33992	16384	494

Table 1. GM200 results.

Launch time (ms)	Memory (KB)	Bandwidth (MB/s)
39	0	0
47.15	4	7
49.86	8	164
57.84	16	283
59.04	32	555
64.73	64	1012
79.08	128	1657
98.41	256	2664
137.15	512	3823
205.56	1024	5101
391.91	2048	5351
745.81	4096	5624
1543.91	8192	5433
3114.83	16384	5386

Table 2. GP100 launch results.

“Better,” however, does not mean “good.” The most important thing to note is that these kernel launch times are VERY SLOW. You can measure synchronous and asynchronous kernel launch times with the nullKernelSync.cu and nullKernelAsync.cu programs in the same directory. On this machine, those times are 46.35 and 3.25 microseconds, respectively. (In fairness, results likely would be better under Linux, especially the synchronous kernel launch. On Windows 7, launching a CUDA kernel always requires the driver to have the operating system do a user-kernel transition or “kernel thunk.” Sadly, no amount of editing can get around the sad fact that CUDA kernels and OS kernels are completely different things and some sentences must refer to both!)

On the Maxwell machine, whatever mechanism NVIDIA is using to copy managed memory back from the GPU has a maximum performance of less than 500MB/s. That’s a nonstarter. It is more than 25x slower than the bus bandwidth. Pascal has improved things, but is still less than half the performance of a PCI Express 3.0 link. A CUDA kernel reporting results via mapped pinned memory would achieve much higher performance.

Superficially, NVIDIA’s implementation makes sense, assuming there is one CPU and one GPU and that the application isn’t doing any fancy tricks with CPU/GPU concurrency. The main mistake in their implementation was failing to speculatively copy extra pages back from the GPU to the CPU in Step 6, an oversight that seems to have been remedied in subsequent releases. The overhead of servicing the page fault is so high that it’s dominated by interrupt handling, not copying of a 4K page, so it makes sense to copy more pages on the page fault until the overhead of the additional copying becomes non-negligible.

Less clear, however, is the optimal behavior of managed memory in a system with multiple GPUs. Does a managed memory buffer get allocated for each GPU? When a kernel is launched on GPU 0, do the other GPUs get copies of the managed memory? Which memory ranges are valid for which GPUs as kernels are executing? And it seems clear that managed memory can’t possibly retain the property that the “owning” device can be inferred from a UVA address, by e.g. calling cudaPointerGetAttributes().

The paradigm also breaks for applications that perform memory copies and kernel processing concurrently.

I submit that the APIs needed to “enlighten” the managed memory subsystem to do the right thing, are at least as complicated as simply writing the CUDA code to explicitly allocate and copy memory.