ECRYPT-EU: Learn You a GPU For Great Good! (Part 1?)

Side note: I stole the title from the most famous, most awesome Haskell book I know.

If you are reading this blog you are most likely interested in cryptography. Today I want to convince you that GPUs are also, well, pretty awesome. I have personally done a few crypto-related projects using GPUs and this post is my attempt at crystallizing the knowledge and experience I built up during that time.

The purpose of this post is to provide a simple, meaningful introduction to developing GPU-accelerated programs. We will discuss setup, the two primary frameworks, basic code examples and development workflow as well as some optimization tips. In the end, I want to show that developing this type of application is not hard at all. If the post is successful I may do a follow-up with a few more detailed and tricky examples. Throughout this post I will assume you are familiar with basic C and/or C++, as the code examples will be in that language. I will not focus too much on develop complicated kernels or how to exploit multi-dimensional parallelism, I will leave that for a later post. Instead, I will focus on a few things that may help you in making the firsts steps towards GPU programming easier, as well as a few things that may help it scale a bit better.

The Why & When

GPU programming was originally designed for, and should be used for, large-scale parallel computation problems. The more parallelism you can utilize, the better GPUs will fit your problem. The most simple example is probably when you loop over a very large collection of elements, performing on each a simple operation independently.

For large-scale parallel computation problems I tend to think of three different architectural setups that you can use (they also mix). The simplest is utilizing multi-core CPUs (possibly over many machines). This has the shortest development time due to its familiarity and easy-of-use and is suitable to many applications. CPUs are of course trivially available. On the other end of the spectrum is the development of custom hardware clusters, utilizing many FPGAs or even ASICs. Development time is fairly long, even for experienced hardware designers; the upside is that this very likely gives you optimal performance.

GPUs fall somewhere in the middle. Development time is very close to that for CPUs; the primary constraint is availability. It is simply easier to get access to CPU clusters. However, these days you can also rent all the GPU power you need from Amazon EC2 instances, as was done for the recent SHA1 collision. If you solve the availability issue, you can get a lot of bang out of your buck performance-wise.

The How

First, you need to get your hands on a machine with a GPU, preferably a remote machine or otherwise a machine with more than one GPU. The reason is that if your GPU is also driving your desktop environment, programming errors may cause your computer to hang or crash. It also allows you to more easily run long-lasting kernels as well as giving you more reliable performance.

CUDA vs OpenCL

Assuming you have a GPU in your system, your next choice is between CUDA and OpenCL, two programming environments for GPU programming. If you do not plan to use an NVIDIA GPU you are stuck with OpenCL, whereas you otherwise have the choice of using CUDA.
Having used both for different projects at different times, I can say that both are perfectly usable and that the differences are mostly superficial. OpenCL is more portable and integrates easier into existing projects; CUDA has the superior tool-chain.

The examples in this post will be for CUDA, as it typically involves less boilerplate. Also, we will use the more basic CUDA C++ implementation, as it provides a better basis for understanding than special-purpose libraries. This is particularly relevant if you want to computations that are not a native part of these libraries, which is definitely true if you want to, for instance, compute CPA-like correlations in parallel.

Hello World

I am not one to break tradition and thus we start the "Hello world" of classic parallel programming, namely SAXPY. Or, more formally, given input vectors $\textbf{x}, \textbf{y}$ of length $n$ and a scalar $a$, compute the output vector $\textbf{z}$ where $\textbf{z} = a\textbf{x} + \textbf{y}$.
First let us consider the basic C implementation of this function, where $z = y$, i.e., we update $y$ using the scalar $a$ and a vector $x$.

 1: void saxpy(int n, float a, float * __restrict__ x, float * __restrict__ y) {
 2:   for (int i = 0; i < n; ++i) {
 3:     y[i] = a*x[i] + y[i];
 4:   }
 5: }
 6: 
 7: // ...
 8: int n = 1 << 20;
 9: // allocate vectors x,y of n elements each.
10: // ...
11: 
12: saxpy(n, 3.14, x, y);

Nothing too special going on here. We simply iterate over every element and perform our update with the scalar $a=3.14$. Note the use of the __restrict__ keyword to indicate that x and y point to different objects in memory. Just giving the compiler a helping hand, which is generally a useful thing to do. Anything that makes it behave less like a random function, I say.

Conversion to CUDA is straightforward. In GPU programming you are always defining what a single parallel unit of computation is doing, this is called a kernel. When programming such a kernel, you are computing from the point of view of the thread. Before delving in too deep, let us see what the CUDA-equivalent code looks like.

 1: __global__
 2: void saxpy(int n, float a, float * __restrict__ x, float * __restrict__ y) {
 3:   int i = blockIdx.x*blockDim.x + threadIdx.x;
 4:   if (i < n) {
 5:     y[i] = a*x[i] + y[i];
 6:   }
 7: }
 8: 
 9: // ...
10: const int n = 1<<20;
11: 
12: // allocate and initialize host-side buffers x,y
13: // ...
14: 
15: // allocate device-side buffers x,y
16: cudaMalloc((void **)&d_x, sizeof(float) * n);
17: cudaMalloc((void **)&d_y, sizeof(float) * n);
18: 
19: // copy host-side buffers to the device
20: cudaMemcpy(d_x, x, sizeof(float) * n, cudaMemcpyHostToDevice);
21: cudaMemcpy(d_y, y, sizeof(float) * n, cudaMemcpyHostToDevice);
22: 
23: // compute the saxpy
24: const int threads_per_block = 256;
25: const int number_of_blocks = n / threads_per_block;
26: saxpy<<<number_of_blocks, threads_per_block>>>(n, 3.14, d_x, d_y);
27: 
28: // copy the output buffer from the device to the host
29: cudaMemcpy(y, d_y, sizeof(float) * n, cudaMemcpyDeviceToHost);
30: 
31: // free the device buffers
32: cudaFree(d_x);
33: cudaFree(d_y);
34: 
35: // clean up the x, y host buffers
36: // ...

Let us consider the kernel first, denoted by the simple fact of the function definition starting with __global__. The parameters to the function are the same as before, nothing special there. Line 3 is a key first step in any kernel: we need to figure out the correct offset into our buffers x and y. To understand this, we need to understand CUDA's notion of threads and blocks (or work groups and work items in OpenCL).

The Grid
The CUDA threading model is fairly straightforward to imagine. A thread essentially computes a single instance of a kernel. These threads form groups called blocks that have somewhat-more-efficient inter-thread communication primitives. The blocks together form what is known as the grid. The grid can have up to three dimensions, i.e., the blocks can be ordered into $(x,y, z)$ coordinates. The same goes for threads inside a block, they can be addressed with $(x, y, z)$ coordinates as well.

Mostly though, I have tended to stick to 1-dimensional grids. This is simply dividing a vector of $n$ elements into $n/m$-sized sequential blocks (even better if $n$ is a multiple of $m$).

A quick note about warps (or wavefronts in OpenCL), which is a related concept. A warp is a unit of scheduling, it determines the amount of threads that actually execute in lockstep. It is good practice to have your block size as a multiple of the warp size but other than that you should not worry overly much about warps.

In this case we find our thread by multiplying the block id with the size of block and then adding the offset of the thread within the block. The rest of the kernel is straightforward, we simply perform the same computation as in the original code but we omit the for-loop. The conditional at line 4 makes sure we do not write outside the bounds of our vector, though that should not happen if we choose our grid carefully.

The rest of the code is the standard boilerplate that you will find in most CUDA programs. A key notion is that there is a distinction between buffers allocated on the device (the GPU) and buffers allocated on the host. Note that on line 26 we schedule the kernel for execution. The first two weird-looking parameters (within angle brackets) are the number of blocks and the block size respectively.

Improving & Testing "Hello World"

To showcase a few things that I found helpful we are going to improve this simple code example. And because this is my blog post and I decide what is in it, I get to talk to you about how to test your code. GPU code tends to be a bit flaky: it breaks easily. Thus, I argue that creating simple tests for your code is essential. These do not have to be very complicated but I recommend that you use a proper framework for writing unit tests. For C++ I have had success with Catch and doctest, both single-headers that you include into your project.

Before we include these tests however, I propose that we make two more changes to the program. First of all, we are going to add better error checking. Most of the cudaFoo functions return a value indicating whether the operation was successful. Otherwise, we get something which we can use to determine the error.

1: #define check(e) { _check((e), __FILE__, __LINE__); }
2: 
3: inline cudaError_t _check(cudaError_t result, const char *file, int line) {
4:   if (result != cudaSuccess) {
5:     fprintf(stderr, "CUDA Runtime Error: %s (%s:%d)\n", cudaGetErrorString(result), file, line);
6:     assert(result == cudaSuccess);
7:   }
8:   return result;
9: }

And then simply wrap the cudaFoo functions with this check macro. Alternatively, you may want to rewrite this to use exceptions instead of asserts. Pick your poison.

Another thing I would recommend adding if you are doing CUDA in C++ is wrapping most of the allocation and de-allocation logic in a class. I generally take a more utilitarian view of classes for simple pieces of code and thus the following is not necessarily idiomatic or good C++ code.

 1: class Saxpy {
 2: public:
 3:   const int n;
 4:   float *d_x;
 5:   float *d_y;
 6:   float *x;
 7:   float *y;
 8: 
 9:   Saxpy(const int n) : n(n) {
10:     x = new float[n];
11:     y = new float[n];
12: 
13:     check(cudaMalloc((void **)&d_x, sizeof(float) * n));
14:     check(cudaMalloc((void **)&d_y, sizeof(float) * n));
15:   }
16: 
17:   ~Saxpy() {
18:     check(cudaFree(d_x));
19:     check(cudaFree(d_y));
20: 
21:     delete[] x;
22:     delete[] y;
23:   }
24: 
25:   Saxpy& fill() {
26:     for (int i = 0; i < n; ++i) {
27:       x[i] = i / 12.34;
28:       y[i] = i / 56.78;
29:     }
30: 
31:     check(cudaMemcpy(d_x, x, sizeof(float) * n, cudaMemcpyHostToDevice));
32:     check(cudaMemcpy(d_y, y, sizeof(float) * n, cudaMemcpyHostToDevice));
33: 
34:     return *this;
35:   }
36: 
37:   Saxpy& run(float a) {
38:     const int threads_per_block = 256;
39:     const int number_of_blocks = n / threads_per_block;
40:     saxpy<<<number_of_blocks, threads_per_block>>>(n, a, d_x, d_y);
41: 
42:     return *this;
43:   }
44: 
45:   Saxpy& load() {
46:     check(cudaDeviceSynchronize());
47:     check(cudaMemcpy(y, d_y, sizeof(float) * n, cudaMemcpyDeviceToHost));
48:     return *this;
49:   }
50: };

Why we went through all this trouble becomes clear if we put this in a test (I am using doctest syntax as an example).

 1: TEST_CASE("testing saxpy") {
 2:   float a = 3.14;
 3:   const int n = 1024;
 4:   Saxpy s(n);
 5: 
 6:   s.fill().run(a).load();
 7: 
 8:   for (int i = 0; i < n; ++i) {
 9:     // we didn't keep the old y values so we recompute them here
10:     float y_i = i / 56.78;
11:     // the approx is because floating point comparison is wacky
12:     CHECK(s.y[i] == doctest::Approx(a * s.x[i] + y_i));
13:   }
14: }

That is a, for C++ standards, pretty concise test. And indeed, our tests succeed. Yay.

===============================================================================
[doctest] test cases:    1 |    1 passed |    0 failed |    0 skipped
[doctest] assertions: 1024 | 1024 passed |    0 failed |

A Final Improvement

Because this post is already too long I will conclude with one last really nice tip that I absolutely did not steal from here. Actually, the NVIDIA developer blogs contain a lot of really good CUDA tips.
Our current kernel is perfectly capable of adapting to a situation where we give it less data than the grid can support. However, if we give it more data, things will break. This is where gride-stride loops come in. It works by looping over the data one grid at a time while maintaining coalesced memory access (which is something I will write about next time).

Here's our new kernel using these kinds of loops.

1: __global__
2: void saxpy(int n, float a, float *x, float *y) {
3:   for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
4:     y[i] = a * x[i] + y[i];
5:   }
6: }

Conclusion

I hope this convinces you that GPU programming is actually pretty simple. The kernel here is pretty trivial, but as long as you understand that within the kernel you can basically write C/C++, you are going to do just fine.

If there is a next post I will write more about memory in GPUs, a very important topic if you want your code to actually run fast. If you want to skip ahead you should read about the different types of memory (global, local, shared, texture, etc.) and what memory coalescing entails.

Until next time.

ECRYPT-EU

Tuesday, March 21, 2017

Learn You a GPU For Great Good! (Part 1?)