This reference summarizes the capabilities added with each version of the Streaming Multiprocessors (SMs), along with examples of chips that implemented the feature set in question. The SM version can be queried by calling cuDeviceComputeCapability(), or by examining the major and minor members of the cudaDeviceProp structure passed back by cudaGetDeviceProperties().
Compute Level |
Introduced… | Example Chip |
SM 1.0 | CUDA | G80* |
SM 1.1 | Global memory atomics; mapped pinned memory; debugger support (e.g. breakpoint instruction) | G84 |
SM 1.2 | Relaxed coalescing constraints; warp voting (any() and all() intrinsics); atomic operations on shared memory | MCP79 |
SM 1.3 | Double precision. | GT200 |
SM 2.0 | 64-bit addressing; L1 and L2 cache; concurrent kernel execution; configurable 16K or 48K shared memory; bit manipulation instructions (__clz(), __popc(), __ffs(), __brev()); directed rounding for single precision floating point values; fused multiply-add; 64-bit clock counter; surface load/store; 64-bit global atomic add, exchange, and compare-and-swap; atomic add for single-precision floating point values; warp voting instructions; assertions and formatted output (printf). | GF100 |
SM 2.1 | Stack-based ABI, enabling function calls and indirect calls within kernels. | GF104 |
SM 3.0 | Warp shuffle; permute; faster global atomics; 32K/32K shared memory configuration; configurable shared memory (32- or 64-bit mode) | GK104 |
SM 3.5 | Bindless textures; 64-bit global atomic min, max, AND, OR, and XOR; 64-bit funnel shift; read global memory via texture; dynamic parallelism | GK110 |
* G80 is the only chip that implemented SM 1.0. All subsequent chips, including G84 and G86, included improvements such as global atomics and the ability to read/write pinned memory.