or: How I Learned To Stop Worrying And Deprecate cudaGetLastError()

Every API designer has struggled with the question of how best to propagate errors to their callers, since before the term “API” was invented. Even decades ago (say 30+ years), interface designers knew to separate the error return from the payload, in functions that return other results to their caller.

Since it is sometimes useful to know what not to do: My favorite example of an antipattern in this department is in the venerable atoi() function in the C runtime, which converts a string to an integer:

int atoi (const char * str);

Since there is no guaranteed-invalid value of int to enable the caller to see that an invalid string (one that cannot be interpreted as an integer) was passed to the function, this interface is fatally flawed. Some implementations even stipulate that the error return is 0, leading to questions about whether the string “0” is a valid input.

The C runtime is lousy with crummy interfaces like this one, and may be the best evidence that the skill sets needed for good language designers and good interface designers are anti-correlated.

Okay – now that we have an example of what not to do, let’s consider the other options for error handling. They fall roughly into three categories:

Exceptions: functions in the API may signal exceptions in the target language. Anyone in the call stack can field the exception by enclosing the API call(s) in a try clause.
Immediate error return: a unified, API-specific error type is defined, with codes that specify the error.
Get-last-error: when an error occurs, a global error state is set that may be queried later by another function in the API.

Let’s start by ruling out exceptions. Even if the target language is known, and supports exceptions, (C++ and Python come to mind), exceptions aren’t always the best choice. After 30 years of exceptions being present in the C++ language, they still are controversial because it is generally difficult to inspect code and know where and why exceptions may be signaled. Raymond Chen, a legendary developer at Microsoft who has worked on Windows for more than 30 years, wrote a few blogs on this topic almost 20 years ago. He had to follow up “Cleaner, More Elegant, and Wrong” with “Cleaner, More Elegant, and Harder To Recognize” because a blanket condemnation of exceptions is hard to support. But when building libraries like the CUDA runtime, or CUDA’s domain-specific libraries such as cuBLAS, cuSOLVER, and the like, it is easy to rationalize reverse-delegating the use of exceptions onto your caller: your client always can, at their own discretion, transmute error returns into exceptions. On the flip side, to incorporate exception handling into such a library (i.e. define an exception type and document which functions will signal which exceptions under which circumstances) is a policy imposition on your client. When you add the consideration that exception handling is language-specific, eschewing exception handling in language-independent library designs becomes an even easier decision.

As far as immediate error returns go: this was the design choice for CUDA from the beginning, with a caveat we’ll get to in the next section. From the beginning, CUDA supported immediate error returns in the form of the CUresult type for the driver API, and cudaError_t for the CUDA runtime. Pretty much all of the API functions return an error status; if they allocate a resource on behalf of the caller, they take a parameter that enables the function to pass back a handle to that resource. That is why cudaMalloc() returns cudaError_t instead of void *, like the C runtime malloc(). Even though cudaMalloc() could have emulated the C runtime’s malloc(), returning a guaranteed-invalid value in the event of failure, it was more important for the CUDA runtime to be self-consistent than to echo the decades-old API design of the C runtime. (If you are jumping up to defend the C runtime, I invite you to review the semantics of the realloc() function. I can do this all day!)

The main problem with immediate error returns is that they can only be checked at the call site. If a client calls a function and it returns an error, and the caller chooses to ignore that error, the details of the error return are lost in the mists of time. But that putative bug is often a feature – in other words, it is common to write code that explicitly ignores possible error returns. Consider a function that allocates two buffers on behalf of its caller:

cudaError_t allocateTwoBuffers( void **bufferA, size_t nA, void **buffer, size_t nB )
{
    cudaError_t err = cudaMalloc( bufferA, nA );
    if ( cudaSuccess == err ) {
        err = cudaMalloc( bufferB, nB );
        if ( cudaSuccess != err ) {
            cudaFree( bufferA );
        }
    }
    return err;
}

Here, the return value from cudaFree() is ignored, because we want to propagate the resource allocation failure, not whatever cudaFree() returns.

And now is a good opportunity to give the C runtime some grace, at least after it was worked over by standards committees, because free() returns void; NULL is specified to be a valid input to free(); and free(NULL) is defined to be a no-op. With benefit of hindsight, it might have been prudent to define similar semantics for the family of CUDA APIs that free resources – not just cudaFree(), but cudaStreamDestroy(), cudaEventDestroy(), and so on. But applications can get the same effect by ignoring the return values from those functions, as illustrated by allocateTwoBuffers() above .

allocateTwoBuffers() is not structured in a way that I’d publish in the CUDA Handbook, because a goto-based error handling idiom scales better if, say, the function is modified to allocate more resources on behalf of its caller.

With exceptions and immediate error returns dealt with, that brings us to get-last-error idioms, which may be found in APIs such as OpenGL with glGetError(), Win32’s GetLastError(), and now CUDA’s cudaGetLastError() and cudaPeekLastError(). It even may be found in the C runtime, in the form of the errno global.

The exact semantics of get-last-error patterns vary from API to API; certainly not every Win32 function updates the error that would be returned from GetLastError(), and not every C runtime function sets errno. But the general idea is that if an error occurs, a global (or on multithreaded systems, thread-local) variable is set that may subsequently be examined to see if an error occurred. The advantage to this approach is that the error is “sticky,” so the error condition can be examined in places other than the call site where an API function returned an error.

I have avoided get-last-error idioms in my APIs, for several reasons:

They are not thread-safe – the “last error” for one CPU thread obviously differs from the “last error” returned to another thread, so the “last error” must be stored in a TLS slot.
They make for a developer education challenge: How often should the last-error be checked? After every call? I do know of a prominent OpenGL application that calls glGetLastError() after every API call. But, no API designer worth their weight in salt would advocate for doubling the number of API calls by their clients. So, how often?
The semantics around setting, examining, and clearing the last-error must be clarified, as evidenced by CUDA having separate cudaGetLastError() and cudaPeekLastError() functions. cudaGetLastError() clears the error code, so it can only be called once downstream of the error. That presents issues that resemble the problem with immediate error codes, which is that they can only be checked at the call site. cudaPeekLastError() gives developers more creative control, but large bodies of code must carefully define their own protocols as to when these various error handling functions are called.
Most importantly, as evidenced by the earlier code fragment allocateTwoBuffers(), it sometimes makes sense for applications to explicitly ignore potential error codes, while passing along error codes that had been returned before.

So CUDA uses immediate error returns. All of CUDA’s functions return error codes, and well-written CUDA applications check every single return value. It is rare for CUDA functions to be unable to fail. As an API designer, I am a big proponent of designing functions that cannot fail, but those are hard to engineer. Typically they entail allocating resources preemptively, so that a resource is known to exist by dint of the code executing at all. (This is why the Windows NT kernel’s family of thread synchronization allocation functions, such as KeInitializeMutex() and KeInitializeSemaphore(), return void. They often are called on memory that was statically allocated, so if the code is running at all, we know the loader successfully allocated the memory where the mutex or semaphore will reside.) CUDA also preallocates resources to decrease the likelihood of a runtime failure; as an example, when CUDA was first built, we made sure every CUDA context preallocated all the staging buffers needed to perform any memcpy. If the allocations of those staging buffers (or any of the other innumerable resources needed by a CUDA context) failed, the CUDA context creation would fail; so anyone with a valid CUDA context handle is in possession of a CUDA context that can perform memcpy’s without ever failing due to a failure to allocate a staging buffer. But CUDA memcpy functions can fail for other reasons, like if the application passes pageable host memory to an asynchronous memcpy. Other examples of functions that are unlikely to fail include cuEventQuery() and cuStreamSynchronize(); but those functions can trigger submissions of pending work to hardware, which can result in failures per the discussion in the next section. So callers still must check their return values.

Considering CUDA is more than 15 years old, there is a surprising diversity in error-handling idioms. Even setting aside exactly what you want to do if an error occurs—print an informative message and exit, propagate the error to your caller, signal an exception, and so on—there is no consensus as to whether to call a templatized utility function or just use a preprocessor macro. Most developers do define macros, though, with names like SAFE_CUDA or CUDART_CHECK (my choice for the CUDA Handbook sample code before I refactored it to use Allan MacKinnon’s clever prefix-macro idiom).

Even the CUDA SDK samples do not have a standardized method of error handling.

Implicit Get-Last-Error

A close examination of CUDA’s immediate-error based API, of course, shows that there is a get-last-error semantic hiding in its putatively immediate error handling regime. cudaDeviceSynchronize() and many CUDA APIs are annotated with language along these lines: “Note that this function may also return error codes from previous, asynchronous launches.” The reason is because any kernel invocation can cause runtime errors that are all-but-impossible to recover from, like dereferencing an invalid pointer. Without providing for more granular error checking, asynchronous APIs (like kernel launch, which always has been asynchronous, even on CUDA 1.0) all but require a get-last-error paradigm. Checking for errors requires CPU/GPU synchronization, which hinders performance. A serviceable precedent to cite here is the way NaNs and INFs propagate in floating point computations – checking every arithmetic operation is too expensive, so having sticky errors propagate into results where they can be checked later, at a coarser grain, was deemed the right tradeoff by the IEEE standards committee.

As best I can tell, it is possible to write completely correct CUDA programs without ever calling cudaGetLastError(). If there are any CUDA functions that set the last-error, but do not return an immediate error code, then NVIDIA made every developer’s job harder for no discernible reason; but there’s no evidence that NVIDIA did that. It looks like they just added the function to more explicitly surface the last-error semantics already baked into the CUDA runtime.

A Prescriptive Remedy

More fine-grained error reporting that didn’t noticeably hinder performance would be relatively easy to add, by the way: just overload CUDA events a bit more. We already have CUDA events that may or may not use blocking waits; why not add an error reporting mechanism? That way, applications could at least bracket which kernel launch caused an error. A new function could query the event’s error status would return the get-last-error status as of the time that the event was signaled:

CUresult cuEventCheckError( CUresult *p, CUevent hEvent, uint32_t Flags );

This hypothetical API could enforce that the event has to be signaled to be a valid input, or you could specify a flag that forces the function to be synchronous (i.e. wait until the event’s wait is satisfied before returning the event’s get-last-error status). The details don’t matter. The API would still have a get-last-error vibe, because if there were multiple kernel launches or CUDA API calls before the event, the same ambiguity would still exist as to which operation caused the error to be signaled. But the granularity would be under control of the application developer, and it would run as fast as CUDA hardware can process events.

cudaGetLastError(): Just Say No

In the meantime, my general take on cudaGetLastError() is that since well-written CUDA code does not have to use it, well-written CUDA should not use it.

Monthly Archives: November 2023

Asynchronous Error Handling Is Hard

or: How I Learned To Stop Worrying And Deprecate cudaGetLastError()

Implicit Get-Last-Error

A Prescriptive Remedy

cudaGetLastError(): Just Say No