This post will discuss SIMD instruction sets, peak floating point performance for CPUs, programming models, and ease-of-use, as applied to the flagship N-body application for CUDA.

My involvement in SIMD instruction sets dates back to the mid-1990s, when x86 vendors were adding MMX (all vendors), SSE (Intel) and 3DNow! (AMD, Cyrix et al.) to the x86 architecture. These SIMD instruction sets reflect a trend that had been developing in CPUs for some time: that because most of the die area in a CPU is dedicated to cache, building more-capable execution units incurs design and validation costs, but does not greatly increase manufacturing costs.

I was the development lead for Direct3D 6.0, and one of our goals was not only to optimize the geometry pipeline, but to incorporate SSE and 3DNow! optimizations from Intel and AMD, respectively. By closely collaborating with the CPU vendors, we gave developers a strong incentive to use Direct3D’s built-in geometry pipeline, and the CPU vendors got an easy way to deliver the benefits of their new instructions to end users.

SSE (Streaming SIMD Extensions) and 3DNow! are SIMD (single instruction, multiple data) instruction sets that can perform multiple floating point operations in a single instruction. SSE added a new set of 128-bit registers, plus instructions that could perform 4×32-bit floating point operations per instruction. (I am not going to further describe 3DNow! because all x86 vendors implemented SSE many years ago.)

Programming languages do not contain native language support for this type of packed data, so when they designed SSE, the x86 vendors were relying on compiler developers to build vectorizing compilers that could automatically identify parallelism opportunities in scalar code and compile it into high-quality, SIMD-optimized code. As a stopgap, they worked with compiler vendors to meet developers halfway and identify special data types and “intrinsics” (special functions that correspond to machine instructions) to operate on them. Developers using intrinsics must be intimately familiar with the instruction set, but the tasks of register allocation or instruction scheduling are offloaded onto the compiler. I don’t think intrinsics were ever intended to be anything other than a stopgap, but unfortunately, almost 15 years later, it looks like they are still the easiest way to write high-performance floating point code.

When I decided to focus on gravitational simulation for my N-body chapter, I also decided to see how fast a modern CPU can perform this computation. Many CUDA-versus-CPU performance comparisons are heavily biased toward CUDA, by dint of the CPU code being poorly optimized.

NVIDIA’s CPU implementation of N-body, used as a reference for correctness, is no exception. Unless a smart vectorizing compiler can translate the C code, the CPU reference implementation makes no use of the SSE or AVX instruction sets. That’s about a 4x (for SSE) or 8x (for AVX) performance disadvantage.

The body-body interaction code (which NVIDIA estimates to be 20 FLOPS) is as follows:

template <**typename** T>
__host__ __device__ void bodyBodyInteraction(
T accel[3],
T x0, T y0, T z0, T mass0,
T x1, T y1, T z1, T mass1,
T softeningSquared)
{
T dx = x1 - x0;
T dy = y1 - y0;
T dz = z1 - z0;
T distSqr = dx*dx + dy*dy + dz*dz;
distSqr += softeningSquared;
T invDist = (T)1.0 / (T)sqrt((double)distSqr);
T invDistCube = invDist * invDist * invDist;
T s = mass1 * invDistCube;
accel[0] += dx * s;
accel[1] += dy * s;
accel[2] += dz * s;
}

This function (templated to provide for both float and double precision) is invoked between every pair of bodies. For each body, the gravitational influence of every other body is added up, then these forces are integrated to change the bodies’ positions and velocities for the next timestep.

Now, if you think a bit about how to accelerate this function using a SIMD instruction set, you realize that what you really want to do is execute it four times in parallel: instead of

T dx = x1 - x0;

you want to write:

__m128 dx = _mm_sub_ps( x1, x0 );

where four bodies’ X positions are subtracted from your body’s X position, and so on. But the memory layout used by NVIDIA’s sample isn’t amenable to this formulation: it must be modified to use an SOA (structure of arrays) representation instead of the default AOS (array of structures) representation. In the case of N-body, where each body has a 3D position (x, y, z) and a mass, the N bodies can be represented as:

typedef struct _body {
float x, y, z;
float mass;
} body;

An array of structures (AOS):

struct _body bodies[N];

then contains the input data. The structure of arrays (SOA) representation splits each member of the structure into its own array:

float *bodiesX;
float *bodiesY;
float *bodiesZ;
float *bodiesMass;

It’s easy enough to transform between the two representations such that these identities hold true:

bodies[i].x == bodiesX[i]
bodies[i].y == bodiesY[i]
bodies[i].z == bodiesZ[i]
bodies[i].mass == bodiesMass[i]

Since programmers think of the X/Y/Z/Mass tuples together, the AOS representation seems more intuitive. But for SSE and AVX, the advantage of the SOA representation quickly becomes obvious; since instructions such as SUBPS (subtract packed single-precision float) operate on corresponding floats in the 128-bit registers, it works much better if the input data is streaming into the CPU in the form of packed X’s, Y’s and Z’s as opposed to structures that have to be rearranged in registers. Intel, in fact, was actively evangelizing SOA representations in the early days of SSE, since it enabled 4×4 matrix multiplication of vertices (needed for the geometry pipeline) without rearranging them. As an additional bonus, if the vertices were 3D, there was no need to deal with the inconvenient 12-byte vertex size. Truly general-purpose applications require extra work due to SSE’s alignment constraints (it is fastest to load and store on 16-byte boundaries), and its predilection for doing pretty much 4 of anything at a time necessitates careful handling of boundary conditions.

Once the data is rearranged as SOA, you can reformulate bodybodyInteraction() using SSE intrinsics to operate on four partial sums (gravitational force contributions to a given body) as follows:

inline void
bodyBodyInteraction(
__m128& a0, __m128& a1, __m128& a2,
const __m128& x0, const __m128& y0, const __m128& z0,
const __m128& x1, const __m128& y1, const __m128& z1, const __m128& mass1,
const __m128& softeningSquared)
{
__m128 dx = _mm_sub_ps( x1, x0 );
__m128 dy = _mm_sub_ps( y1, y0 );
__m128 dz = _mm_sub_ps( z1, z0 );
__m128 distSq = _mm_add_ps( _mm_add_ps( _mm_mul_ps( dx, dx ), _mm_mul_ps( dy, dy ) ), _mm_mul_ps( dz, dz ) );
distSq = _mm_add_ps( distSq, softeningSquared );
__m128 invDist = rcp_sqrt_nr_ps( distSq );
__m128 invDistCube = _mm_mul_ps( invDist, _mm_mul_ps( invDist, invDist ) );
__m128 s = _mm_mul_ps( mass1, invDistCube );
a0 = _mm_add_ps( a0, _mm_mul_ps( dx, s ) );
a1 = _mm_add_ps( a1, _mm_mul_ps( dy, s ) );
a2 = _mm_add_ps( a2, _mm_mul_ps( dz, s ) );
}

This routine uses the <xmmintrin.h> header file (available both on Linux for gcc and Windows via MSVC), which declares the __m128 data type and the intrinsic functions that correspond to the SSE instruction set.

This routine uses two subroutines, both of which I was disappointed not to find in <xmmintrin.h> (in fact, <xmmintrin.h> seems not to have changed since 1998 or so when it was introduced): a subroutine to compute the sum of the four 32-bit floats in an SSE register, and a subroutine to compute the reciprocal square root. Because the RSQRTPS instruction only returns a 12-bit approximation, the *rcp_sqrt_nr_ps()* function, adapted from here, performs one Newton-Raphson iteration to achieve almost-full single precision on the four 32-bit floats in the SSE register. NOTE TO CPU ARCHITECTS: the SFU (“special function unit) in CUDA hardware implements a much higher-precision (22-23 bits) approximation to reciprocal square root that often can be used directly. The instruction sequence generated by *rcp_sqrt_nr_ps()* is 6 FLOPS but as many as 10-12 instructions (depending on how the constants are handled).

static inline __m128
rcp_sqrt_nr_ps(const __m128 x)
{
const __m128
nr = _mm_rsqrt_ps(x),
muls = _mm_mul_ps(_mm_mul_ps(nr, nr), x),
beta = _mm_mul_ps(_mm_set_ps1(0.5f), nr),
gamma = _mm_sub_ps(_mm_set_ps1(3.0f), muls);
return _mm_mul_ps(beta, gamma);
}

The other subroutine, adapted from a stackoverflow answer, computes the “horizontal” sum of the four 32-bit floats in an SSE register:

static inline __m128
horizontal_sum_ps( const __m128 x )
{
const __m128 t = _mm_add_ps(x, _mm_movehl_ps(x, x));
return _mm_add_ss(t, _mm_shuffle_ps(t, t, 1));
}

I don’t know about you guys, but I don’t think this intrinsics-based SSE code comes close to approaching the readability of the CUDA code, which “looks” scalar and is templatized to support double precision.

It does run fast, however; the resulting N-Body implementation, which is still single-threaded, takes 54 milliseconds to process 4096 bodies (4096^{2}=16777216 interactions) as opposed to 583 milliseconds for the naïve (presumably x87) implementation. I haven’t yet analyzed why it is 10x faster instead of 4x faster; I suspect the SSE reciprocal square root sequence is quite a bit faster than the scalar equivalent. Note, however, that recompiling the application with SSE2 optimizations enabled does not improve performance noticeably (not with Visual Studio, anyway).

Further performance opportunities exist in multithreading (it should scale almost-linearly in the number of cores) and porting to AVX, which should almost-double performance.

The not-yet-optimized CUDA implementation takes about 4.5 ms to do the same computation on a GK104: not 100x faster, but 10x faster and the code is much more readable. A side-by-side comparison of the two bodyBodyInteraction() routines illustrates the point made by Dr. Vincent Natoli, in his *Kudos for CUDA* article in hpcwire:

In a recent project we reduced 3,500 lines of highly-optimized C code to a CUDA kernel of about 800 lines. The optimized C was peppered with inline assembly, SSE macros, unrolled loops and special cases, making it difficult to read, extract algorithmic meaning and extend in the future. By comparison the CUDA code was cleaner and more readable. Ultimately it will be easier to maintain.

Which would you rather write?

T distSqr = dx*dx + dy*dy + dz*dz;

or

__m128 distSq = _mm_add_ps( _mm_add_ps( _mm_mul_ps( dx, dx ), _mm_mul_ps( dy, dy ) ), _mm_mul_ps( dz, dz ) );

?

The story’s still out on exactly how much of a performance advantage CUDA will have over a truly optimized CPU implementation of this application, but until vectorized compilers go mainstream, for this type of code, CUDA will have a big advantage in programmability.