C++ Accelerate Massive Parallelism

by Call me dave... 25. June 2011 22:43

I have been an avid follower of Herb Sutter’s writings for years, from his magazine articles in the C++ Report and the C/C++ Users Journal, his books on C++ development and recently his series of posts on parallel programming and concurrency for Dr Dobbs.  Herb is the chair of the C++ standards committee where he and others are near to finalising C++0x standard.  In essence Herb knows his stuff.

Herb recently announced C++ AMP at the AMD Fusion conference and I was blown away.  Microsoft have modified their C++ compiler so that it can target both the CPU and a DirectX compatible graphics card within the same program.  This enables a developer to write code that utilises the hundreds of cores available on graphics cards, to perform blazingly fast parallel computations, without resorting to the current practice of writing code in a language designed for graphics processing (High Level Shader Language [HLSL] or GL Shading Language [GLSL]) or within a C like language (CUDA or OpenCL) where the OO benefits of C++ are missing.

Instead the keyword restrict is used to mark a method as being able to be executed on the graphics card.  The developer then uses a subset of the C++ language (enforced by the compiler) to implement the computation.  The complexity of moving data between the CPU, RAM and the GPU is also simplified by the introduction of a set of classes that automatically handle the marshalling of data between the devices.  The compiler converts the C++ code into HLSL code which is then embedded into the executable.  At runtime the HLSL code is sent to the DirectX driver which in turn converts it into the appropriate machine code (for a particular device) and executes it on the graphics card.

The example below (from Daniel Moth’s presentation - URL below) is a simple matrix by matrix multiplication.  Its clear that the outer for loop can be parallelised.  An interesting point is the complex indexing to pull out values from vA and vb which are both one dimensional arrays but actually store a two dimensional structure.

void MatrixMultiply( vector<float>& C,
const vector<float>& vA,
const vector<float>& vB, int M, int N, int W )
  for (int y = 0; y < M; y++) {
    for (int x = 0; x < N; x++) {
      float sum = 0;
      for(int i = 0; i < W; i++)
      sum += vA[y * W + i] * vB[i * N + x];
      vC[y * N + x] = sum;
Using C++ AMP the matrices are marshalled across to the GPU by wrapping the vectors in array_views.  Further the array_views are used to project the two dimensional matrix onto the one dimensional vector and so the complex indexing code isn’t present in the C++ AMP version.  The code that executes on the GPGPU are the lines within the lambda expression that marked with the restrict(direct3d) keyword.  On the graphics card the texture that is getting the result of computation is write only hence variable c is of type array_view<writeonly<>>.
void MatrixMultiply( vector<float>& vC,
const vector<float>& vA,
const vector<float>& vB, int M, int N, int W )
  array_view<const float,2> a(M,W,vA),b(W,N,vB);
  array_view<writeonly<float>,2> c(M,N,vC);
  parallel_for_each(c.grid, [=](index<2> idx) restrict(direct3d) {
   float sum = 0;
   for(int i = 0; i < a.extent.x; i++)
     sum += a(idx.y, i) * b(i, idx.x);
     c[idx] = sum;

There are millions of lines of C++ code used in finance industry for modelling derivatives and many of the larger institutions are now looking at GPGPU technology to give them an edge in the competitive world of algorithmic trading.  CUDA is currently the default choice for GPGPU development it will be very interesting in twelve months to see if this trend reverses due to both the simplicity of parallelising a function using AMP C++.

Herb’s keynote.

Daniel Moth: Blazing-fast code using GPUs and more, with C++ AMP

NVidia will also support C++ AMP but points out CUDA is available for Linux and Mac as well as Windows

Finally Microsoft have indicated that they will take their C++ extensions to a standards board thus enabling other compiler vendors to implement the restrict keyword in their products.  Hopefully this will mean that in a couple of years it will be possible to have C++ code target other devices such as FPGAs, PS3’s cell processor and Intel’s Many Integrated Cores (MIC) daughter board.


Side Note: 

Eventually all good ideas come back again. When C with Classes was being created it had a readonly and a writeonly keyword and when C++ was created the C standard committee liked the concept so much they asked that the readonly keyword by renamed to const and added it to the language. Meanwhile writeonly was dropped entirely. Instead of adding writeonly as a keyword Microsoft have gone down the library root i.e. writeonly<T> instead of writeonly T. Personally I think a keyword would have been far cleaner but then again I don’t work on the standards committee and have no idea how complex it would be to add into the official ISO C++ standard…  But considering how long its taken to complete this version of the standard I cant say I am terribly surprised...



26/06/2011 7:17:06 AM #


Pingback from cardprocessing.funwit.us

C++ Accelerate Massive Parallelism | Card ProcessingCard Processing

cardprocessing.funwit.us |

Comments are closed