The following errors have been discovered in the first printing of The CUDA Handbook.
If you find a mistake that is not listed here, please write an email so it can be added to this page and possibly corrected in future printings.
Page 34-35: Figures 2.24 and 2.25 (CPU-bound and GPU-bound push buffers) are reversed.
Page 102: Table 4.8 should read: “machine [32:64]” (credit to Daniel Galvez)
Page 175: “Without it, the GPU would still be processing the last kernel invocations when the end top is recorded…” – that should be “when the end time is recorded…”
Page 252: In Table 8.11, second row from the bottom (log2x), the basne of the logarithm (2) should be subscripted and x should not be subscripted.
Page 259: For the atan2() function, the expression is incorrect. There should not be an ‘x’ between the superscripted -1 and the parenthesized expression (y/x).
Pages 307-308: the next multiple of 64 above 950 is 960, not 964.
Page 310: The “width” parameter of cudaMallocArray is the number of elements, not bytes.
Page 366: Figure 12.1 – the “log-step reduction” expression should read: “(((a0+a4)+(a1+a5))+((a2+a6)+(a3+a7)))
Page 393: Figure 13.10 contains some extra fans. Numbering the columns 0 to 15 (left to right), and the rows 0 to 4 (top to bottom), the fans originating at locations (3, 1), (7, 1) and (11, 1) should not be present. Thanks to Peter Longhurst for pointing this out!
Page 417: Line 147 of streamCompact_odd.cuh declares an int instead of T (thanks to Louise Knight for pointing this out!):
<int value = (index < N) ? in[index] : 0;
>T value = (index < N) ? in[index] : 0;
Hi, I think there is something wrong is the function “chMemcpyHtoD” of pageableMemcpyHtoD.cu as following:
void
chMemcpyHtoD( void *device, const void *host, size_t N )
{
cudaError_t status;
char *dst = (char *) device;
const char *src = (const char *) host;
int stagingIndex = 0;
while ( N ) {
size_t thisCopySize = min( N, STAGING_BUFFER_SIZE );
CUDART_CHECK( cudaEventSynchronize( g_events[stagingIndex] ) );
memcpy( g_hostBuffers[stagingIndex], src, thisCopySize );
CUDART_CHECK( cudaMemcpyAsync( dst, g_hostBuffers[stagingIndex], thisCopySize,
cudaMemcpyHostToDevice, NULL ) );
CUDART_CHECK( cudaEventRecord( g_events[1-stagingIndex], NULL ) );
dst += thisCopySize;
src += thisCopySize;
N -= thisCopySize;
stagingIndex = 1 – stagingIndex;
}
Error:
return;
}
It’s so weird why it’s “1-stagingIndex” in “CUDART_CHECK( cudaEventRecord( g_events[1-stagingIndex], NULL ) );”, but not “stagingIndex”, and this causes the copy going serially.
After I changed the “1-stagingIndex” to “stagingIndex”, I still got right result, and got a double speed.