Decoder Ring-SM Versions

This reference summarizes the capabilities added with each version of the Streaming Multiprocessors (SMs), along with examples of chips that implemented the feature set in question. The SM version can be queried by calling cuDeviceComputeCapability(), or by examining the major and minor members of the cudaDeviceProp structure passed back by cudaGetDeviceProperties().
 Compute
Level
 Introduced… Example
Chip
 SM 1.0  CUDA  G80*
 SM 1.1  Global memory atomics; mapped pinned memory; debugger support (e.g. breakpoint instruction)  G84
 SM 1.2  Relaxed coalescing constraints; warp voting (any() and all() intrinsics); atomic operations on shared memory  MCP79
 SM 1.3 Double precision.  GT200
 SM 2.0 64-bit addressing; L1 and L2 cache; concurrent kernel execution; configurable 16K or 48K shared memory; bit manipulation instructions (__clz()__popc()__ffs()__brev()); directed rounding for single precision floating point values; fused multiply-add; 64-bit clock counter; surface load/store; 64-bit global atomic add, exchange, and compare-and-swap; atomic add for single-precision floating point values; warp voting instructions; assertions and formatted output (printf).  GF100
 SM 2.1 Stack-based ABI, enabling function calls and indirect calls within kernels.  GF104
 SM 3.0  Warp shuffle; permute; faster global atomics; 32K/32K shared memory configuration; configurable shared memory (32- or 64-bit mode)  GK104
 SM 3.5 Bindless textures; 64-bit global atomic min, max, AND, OR, and XOR; 64-bit funnel shift; read global memory via texture; dynamic parallelism  GK110
* G80 is the only chip that implemented SM 1.0. All subsequent chips, including G84 and G86, included improvements such as global atomics and the ability to read/write pinned memory.

Leave a Reply