A new version of the Streaming Multiprocessors chapter has been uploaded, this one with merged coverage of the math library for float and double, plus improved coverage of shared memory (especially shared memory atomics) and conditional code.
The code emitted by the compiler when performing shared memory atomics turns out to be the perfect illustration of how CUDA hardware handles conditional code. For the SM 2.0 architecture, a shared atomic add compiles to the following microcode (excerpted from Listing 8-2):
/*0040*/ SSY 0x80; /*0048*/ BAR.RED.POPC RZ, RZ; /*0050*/ LD R0, [R0]; /*0058*/ LDSLK P0, R2, [R3]; /*0060*/ @P0 IADD R2, R2, R0; /*0068*/ @P0 STSUL [R3], R2; /*0070*/ @!P0 BRA 0x58; /*0078*/ NOP.S CC.T;
The SSY/NOP.S instructions bracket the divergence and convergence of the loop, which iterates until all threads in the warp have performed their atomic operation. Inside the loop, the LDSLK instruction attempts to lock the bank in shared memory. The actual atomic operation is predicated on whether the lock was acquired; conversely, the branch to reattempt the atomic operation is predicated so as to be taken if the lock was not acquired.
Enjoy the new version of the chapter, and stay tuned as we turn our attention to the other topics that the book will cover!