Monthly Archives: August 2013

Why does CUDA CUdeviceptr use unsigned int instead of void?

This question on StackExchange was put on hold as primarily opinion-based: “…answers to this question will tend to be almost entirely based on opinions, rather than facts, references, or specific expertise.”

The content of StackExchange is usually high quality, but in this case, while the design decision was based on opinion, the answer to the question needn’t be… you just need to ask the people who know! And the inimitable talonmies, who is poised to crack 30k on StackExchange’s points-based reputation system, compounded the problem by saying that CUdeviceptr is a handle to a device memory allocation, not a pointer.

I don’t think I have ever seen talonmies give an incorrect answer before; but in this case, he’s off the mark. CUdeviceptr always has represented a pointer in the CUDA address space. In fact, though it was frowned upon to mix driver API and CUDA runtime code, even in CUDA 1.0 you could transform between CUDART’s void * and the driver API’s CUdeviceptr by writing something like:

void *p;
CUdeviceptr dptr;
p = (void *) (uintptr_t) dptr;
dptr = (CUdeviceptr) (uintptr_t) p;

We could have made device pointers void *, but there was a desire to make it easy for compilers to distinguish between host and device pointers at compile time instead of runtime. Furthermore, SM 1.x hardware only supported 32-bit pointers, so using void * would have created a difference in pointer size on 64-bit host platforms. It’s a long-distant memory now, since so much great compiler work has gone into CUDA since then, but at the time “pointer-squashing” (having CUDA’s compiler transform 64-bit pointers into 32-bit pointers on 64-bit host systems) was a big issue in early versions of CUDA.

For the record, not making the driver API’s device pointer type void * is one of my bigger regrets about early CUDA development. It took months to refactor the driver to support 64-bit device pointers when hardware support for that feature became available in SM 2.x class hardware.

In fact, some weeks before we released CUDA 1.0, we had a meeting and a serious discussion about replacing CUdeviceptr with void *, and decided not to take the schedule hit. We weren’t going to let perfect be the end of done, and we paid the price later.

While we’re on the topic of regrettable design decisions in early CUDA, I wish I had done a search-and-replace to convert cuFunction to cuKernel, and put cuLaunchKernel in the first release (in place of the stateful, chatty and not-thread-safe cuParamSet* family of functions). But we had scant engineering resources to spend on fit and finish, a constraint that is no less true for CUDA than for many other successful software projects in history.

Floating point: CPU and GPU differences

Some time ago, I wrote this in response to a StackOverflow question, but wanted to share here on the blog.

The question basically asked how you could make a floating point operation the same between the CPU and the GPU, and here is an updated version of the answer:

There are many reasons why it is not realistic to expect the same results from floating point computations run on the CPU and GPU. It’s much stronger than that: you can’t assume that FP results will be the same when the same source code is compiled against a different target architecture (e.g. x86 or x64) or with different optimization levels, either.

In fact, if your code is multithreaded and the FP operations are performed in different orders from one run to the next, then the EXACT SAME EXECUTABLE running on the EXACT SAME SYSTEM may produce slightly different results from one run to the next.

Some of the reasons include, but are not limited to:

  • floating point operations are not associative, so seemingly-benign reorderings (such as the race conditions from multithreading mentioned above) can change results;
  • different architectures support different levels of precision and rounding under different conditions (i.e. compiler flags, control word versus per instruction);
  • different compilers interpret the language standards differently, and
  • some architectures support FMAD (fused multiply-add) and some do not.

Note that for purposes of this discussion, the JIT compilers for CUDA (the magic that enables PTX code to be future-proof to GPU architectures that are not yet available) certainly should be expected to perturb FP results.

You have to write FP code that is robust despite the foregoing.

As I write this today, I believe that CUDA GPUs have a much better-designed architecture for floating point arithmetic than any contemporary CPU. GPUs include native IEEE standard (c. 2008) support for 16-bit floats and FMAD, have full-speed support for denormals, and enable rounding control on a per-instruction basis rather than control words whose settings have side effects on all FP instructions and are expensive to change.

In contrast, CPUs have an excess of per-thread state and poor performance except when using SIMD instructions, which mainstream compilers are terrible at exploiting for performance (since vectorizing scalar C code to take advantage of such instruction sets is much more difficult than building a compiler for a pseudo-scalar architecture such as CUDA). And if the wikipedia History page is to be believed, Intel and AMD appear to have completely botched the addition of FMAD support in a way that defies description.

You can find an excellent discussion of floating point precision and IEEE support in NVIDIA GPUs here.