Tuesday, March 21, 2017

Learn You a GPU For Great Good! (Part 1?)

Side note: I stole the title from the most famous, most awesome Haskell book I know.

If you are reading this blog you are most likely interested in cryptography. Today I want to convince you that GPUs are also, well, pretty awesome. I have personally done a few crypto-related projects using GPUs and this post is my attempt at crystallizing the knowledge and experience I built up during that time.

The purpose of this post is to provide a simple, meaningful introduction to developing GPU-accelerated programs. We will discuss setup, the two primary frameworks, basic code examples and development workflow as well as some optimization tips. In the end, I want to show that developing this type of application is not hard at all. If the post is successful I may do a follow-up with a few more detailed and tricky examples. Throughout this post I will assume you are familiar with basic C and/or C++, as the code examples will be in that language. I will not focus too much on develop complicated kernels or how to exploit multi-dimensional parallelism, I will leave that for a later post. Instead, I will focus on a few things that may help you in making the firsts steps towards GPU programming easier, as well as a few things that may help it scale a bit better.

The Why & When

GPU programming was originally designed for, and should be used for, large-scale parallel computation problems. The more parallelism you can utilize, the better GPUs will fit your problem. The most simple example is probably when you loop over a very large collection of elements, performing on each a simple operation independently.

For large-scale parallel computation problems I tend to think of three different architectural setups that you can use (they also mix). The simplest is utilizing multi-core CPUs (possibly over many machines). This has the shortest development time due to its familiarity and easy-of-use and is suitable to many applications. CPUs are of course trivially available. On the other end of the spectrum is the development of custom hardware clusters, utilizing many FPGAs or even ASICs. Development time is fairly long, even for experienced hardware designers; the upside is that this very likely gives you optimal performance.

GPUs fall somewhere in the middle. Development time is very close to that for CPUs; the primary constraint is availability. It is simply easier to get access to CPU clusters. However, these days you can also rent all the GPU power you need from Amazon EC2 instances, as was done for the recent SHA1 collision. If you solve the availability issue, you can get a lot of bang out of your buck performance-wise.

The How

First, you need to get your hands on a machine with a GPU, preferably a remote machine or otherwise a machine with more than one GPU. The reason is that if your GPU is also driving your desktop environment, programming errors may cause your computer to hang or crash. It also allows you to more easily run long-lasting kernels as well as giving you more reliable performance.

CUDA vs OpenCL

Assuming you have a GPU in your system, your next choice is between CUDA and OpenCL, two programming environments for GPU programming. If you do not plan to use an NVIDIA GPU you are stuck with OpenCL, whereas you otherwise have the choice of using CUDA.
Having used both for different projects at different times, I can say that both are perfectly usable and that the differences are mostly superficial. OpenCL is more portable and integrates easier into existing projects; CUDA has the superior tool-chain.

The examples in this post will be for CUDA, as it typically involves less boilerplate. Also, we will use the more basic CUDA C++ implementation, as it provides a better basis for understanding than special-purpose libraries. This is particularly relevant if you want to computations that are not a native part of these libraries, which is definitely true if you want to, for instance, compute CPA-like correlations in parallel.

Hello World

I am not one to break tradition and thus we start the "Hello world" of classic parallel programming, namely SAXPY. Or, more formally, given input vectors \(\textbf{x}, \textbf{y}\) of length \(n\) and a scalar \(a\), compute the output vector \(\textbf{z}\) where \(\textbf{z} = a\textbf{x} + \textbf{y}\).
First let us consider the basic C implementation of this function, where \(z = y\), i.e., we update \(y\) using the scalar \(a\) and a vector \(x\).

 1: void saxpy(int n, float a, float * __restrict__ x, float * __restrict__ y) {
 2:   for (int i = 0; i < n; ++i) {
 3:     y[i] = a*x[i] + y[i];
 4:   }
 5: }
 7: // ...
 8: int n = 1 << 20;
 9: // allocate vectors x,y of n elements each.
10: // ...
12: saxpy(n, 3.14, x, y);

Nothing too special going on here. We simply iterate over every element and perform our update with the scalar \(a=3.14\). Note the use of the __restrict__ keyword to indicate that x and y point to different objects in memory. Just giving the compiler a helping hand, which is generally a useful thing to do. Anything that makes it behave less like a random function, I say.

Conversion to CUDA is straightforward. In GPU programming you are always defining what a single parallel unit of computation is doing, this is called a kernel. When programming such a kernel, you are computing from the point of view of the thread. Before delving in too deep, let us see what the CUDA-equivalent code looks like.

 1: __global__
 2: void saxpy(int n, float a, float * __restrict__ x, float * __restrict__ y) {
 3:   int i = blockIdx.x*blockDim.x + threadIdx.x;
 4:   if (i < n) {
 5:     y[i] = a*x[i] + y[i];
 6:   }
 7: }
 9: // ...
10: const int n = 1<<20;
12: // allocate and initialize host-side buffers x,y
13: // ...
15: // allocate device-side buffers x,y
16: cudaMalloc((void **)&d_x, sizeof(float) * n);
17: cudaMalloc((void **)&d_y, sizeof(float) * n);
19: // copy host-side buffers to the device
20: cudaMemcpy(d_x, x, sizeof(float) * n, cudaMemcpyHostToDevice);
21: cudaMemcpy(d_y, y, sizeof(float) * n, cudaMemcpyHostToDevice);
23: // compute the saxpy
24: const int threads_per_block = 256;
25: const int number_of_blocks = n / threads_per_block;
26: saxpy<<<number_of_blocks, threads_per_block>>>(n, 3.14, d_x, d_y);
28: // copy the output buffer from the device to the host
29: cudaMemcpy(y, d_y, sizeof(float) * n, cudaMemcpyDeviceToHost);
31: // free the device buffers
32: cudaFree(d_x);
33: cudaFree(d_y);
35: // clean up the x, y host buffers
36: // ...

Let us consider the kernel first, denoted by the simple fact of the function definition starting with __global__. The parameters to the function are the same as before, nothing special there. Line 3 is a key first step in any kernel: we need to figure out the correct offset into our buffers x and y. To understand this, we need to understand CUDA's notion of threads and blocks (or work groups and work items in OpenCL).
The Grid
The CUDA threading model is fairly straightforward to imagine. A thread essentially computes a single instance of a kernel. These threads form groups called blocks that have somewhat-more-efficient inter-thread communication primitives. The blocks together form what is known as the grid. The grid can have up to three dimensions, i.e., the blocks can be ordered into \((x,y, z)\) coordinates. The same goes for threads inside a block, they can be addressed with \((x, y, z)\) coordinates as well.
Mostly though, I have tended to stick to 1-dimensional grids. This is simply dividing a vector of \(n\) elements into $n/m$-sized sequential blocks (even better if \(n\) is a multiple of \(m\)). 
A quick note about warps (or wavefronts in OpenCL), which is a related concept. A warp is a unit of scheduling, it determines the amount of threads that actually execute in lockstep. It is good practice to have your block size as a multiple of the warp size but other than that you should not worry overly much about warps.
In this case we find our thread by multiplying the block id with the size of block and then adding the offset of the thread within the block. The rest of the kernel is straightforward, we simply perform the same computation as in the original code but we omit the for-loop. The conditional at line 4 makes sure we do not write outside the bounds of our vector, though that should not happen if we choose our grid carefully.

The rest of the code is the standard boilerplate that you will find in most CUDA programs. A key notion is that there is a distinction between buffers allocated on the device (the GPU) and buffers allocated on the host. Note that on line 26 we schedule the kernel for execution. The first two weird-looking parameters (within angle brackets) are the number of blocks and the block size respectively.

Improving & Testing "Hello World"

To showcase a few things that I found helpful we are going to improve this simple code example. And because this is my blog post and I decide what is in it, I get to talk to you about how to test your code. GPU code tends to be a bit flaky: it breaks easily. Thus, I argue that creating simple tests for your code is essential. These do not have to be very complicated but I recommend that you use a proper framework for writing unit tests. For C++ I have had success with Catch and doctest, both single-headers that you include into your project.

Before we include these tests however, I propose that we make two more changes to the program. First of all, we are going to add better error checking. Most of the cudaFoo functions return a value indicating whether the operation was successful. Otherwise, we get something which we can use to determine the error.

1: #define check(e) { _check((e), __FILE__, __LINE__); }
3: inline cudaError_t _check(cudaError_t result, const char *file, int line) {
4:   if (result != cudaSuccess) {
5:     fprintf(stderr, "CUDA Runtime Error: %s (%s:%d)\n", cudaGetErrorString(result), file, line);
6:     assert(result == cudaSuccess);
7:   }
8:   return result;
9: }

And then simply wrap the cudaFoo functions with this check macro. Alternatively, you may want to rewrite this to use exceptions instead of asserts. Pick your poison.

Another thing I would recommend adding if you are doing CUDA in C++ is wrapping most of the allocation and de-allocation logic in a class. I generally take a more utilitarian view of classes for simple pieces of code and thus the following is not necessarily idiomatic or good C++ code.

 1: class Saxpy {
 2: public:
 3:   const int n;
 4:   float *d_x;
 5:   float *d_y;
 6:   float *x;
 7:   float *y;
 9:   Saxpy(const int n) : n(n) {
10:     x = new float[n];
11:     y = new float[n];
13:     check(cudaMalloc((void **)&d_x, sizeof(float) * n));
14:     check(cudaMalloc((void **)&d_y, sizeof(float) * n));
15:   }
17:   ~Saxpy() {
18:     check(cudaFree(d_x));
19:     check(cudaFree(d_y));
21:     delete[] x;
22:     delete[] y;
23:   }
25:   Saxpy& fill() {
26:     for (int i = 0; i < n; ++i) {
27:       x[i] = i / 12.34;
28:       y[i] = i / 56.78;
29:     }
31:     check(cudaMemcpy(d_x, x, sizeof(float) * n, cudaMemcpyHostToDevice));
32:     check(cudaMemcpy(d_y, y, sizeof(float) * n, cudaMemcpyHostToDevice));
34:     return *this;
35:   }
37:   Saxpy& run(float a) {
38:     const int threads_per_block = 256;
39:     const int number_of_blocks = n / threads_per_block;
40:     saxpy<<<number_of_blocks, threads_per_block>>>(n, a, d_x, d_y);
42:     return *this;
43:   }
45:   Saxpy& load() {
46:     check(cudaDeviceSynchronize());
47:     check(cudaMemcpy(y, d_y, sizeof(float) * n, cudaMemcpyDeviceToHost));
48:     return *this;
49:   }
50: };

Why we went through all this trouble becomes clear if we put this in a test (I am using doctest syntax as an example).

 1: TEST_CASE("testing saxpy") {
 2:   float a = 3.14;
 3:   const int n = 1024;
 4:   Saxpy s(n);
 6:   s.fill().run(a).load();
 8:   for (int i = 0; i < n; ++i) {
 9:     // we didn't keep the old y values so we recompute them here
10:     float y_i = i / 56.78;
11:     // the approx is because floating point comparison is wacky
12:     CHECK(s.y[i] == doctest::Approx(a * s.x[i] + y_i));
13:   }
14: }

That is a, for C++ standards, pretty concise test. And indeed, our tests succeed. Yay.

[doctest] test cases:    1 |    1 passed |    0 failed |    0 skipped
[doctest] assertions: 1024 | 1024 passed |    0 failed |

A Final Improvement

Because this post is already too long I will conclude with one last really nice tip that I absolutely did not steal from here. Actually, the NVIDIA developer blogs contain a lot of really good CUDA tips.
Our current kernel is perfectly capable of adapting to a situation where we give it less data than the grid can support. However, if we give it more data, things will break. This is where gride-stride loops come in. It works by looping over the data one grid at a time while maintaining coalesced memory access (which is something I will write about next time).

Here's our new kernel using these kinds of loops.

1: __global__
2: void saxpy(int n, float a, float *x, float *y) {
3:   for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; i += blockDim.x * gridDim.x) {
4:     y[i] = a * x[i] + y[i];
5:   }
6: }


I hope this convinces you that GPU programming is actually pretty simple. The kernel here is pretty trivial, but as long as you understand that within the kernel you can basically write C/C++, you are going to do just fine.

If there is a next post I will write more about memory in GPUs, a very important topic if you want your code to actually run fast. If you want to skip ahead you should read about the different types of memory (global, local, shared, texture, etc.) and what memory coalescing entails.

Until next time.

Tuesday, March 14, 2017

What are lattices?

Do all the cool post-quantum cryptographers around you talk about lattices and all you can think about is a good salad?

Don't worry! You can be part of the hip crowd at the next cocktail party in just three easy steps!

First, I will tell you what a lattice is. It is actually really easy.
Second, we will prove a tiny fact about lattices so that those poor neurons that did all the work in the first step have some friends.
Third, nothing. There isn't even a third step. That's how easy it is!

So, let's tackle the first step. What, actually, is a lattice?
The answer could not be easier. It is simply a grid of points. Like this:

Easy, right?
The formal definition is just as simple:

Given $n$ linearly independent vectors $b_1, b_2, \ldots, b_n \in \mathbb{R}^m$, the lattice generated by them is defined as
\mathcal{L}(b_1, b_2, \ldots, b_n) = \left\{ \sum x_i b_i \middle| x_i \in \mathbb{Z} \right\}.

We can add those so called basis vectors $b_i$ to our diagram and will readily see how each point is constructed.

And now you already know what a lattice is! In the next step we will talk about some more definitions and show the proof of a theorem.

We will be discussing something related to the question of what the length of the shortest non-zero vector in a given lattice $\mathcal{L}$ is.

$\lambda_1(\mathcal{L})$ is the length of the shortest non-zero vector in $\mathcal{L}$. We use the euclidean norm for the definition of length.
And we also need the volume of a lattice which we are not going to define formally we will just take it to be the volume of that $n$ dimensional parallelogram in the picture:

Now we can already prove Blichfeld's Theorem! It makes a statement about where we can find a lattice point and roughly goes as follows:
For a lattice $\mathcal{L}\subseteq \mathbb{R}^n$ and a set $S \subseteq \mathbb{R}^n$ where the volume of $S$ is bigger than the volume of the lattice there exist two nonequal points $z_1, z_2 \in S$ such that $z_1 - z_2 \in \mathcal{L}$.
And here is the idea of the proof in a graphical way! First we simply draw an example for the set $S$ in blue:

Now we cut up our set and move all the parts into our first parallelogram!

As you can see there is quite a bit of overlap and the fact that the full volume of the set is bigger than the volume of the lattice aka the parallelogram guarantees there will be at least some overlap somewhere. But if there is overlap then we have two points, that have been moved by multiples of the base vectors and which are now at the same position!

And therefore their difference must be a multiple of base vectors as well and we have found a difference which is a lattice point! Hooray!

Now we will use a little trick to expand this result and make a statement about an actual point from the set being on the lattice not just the difference of two points!

What we will show is called Minkowski's Convex Body Theorem and it states roughly
Let $\mathcal{L}$ be a lattice. Then any centrally-symmetric convex set $S$, with volume bigger than $2^n$ times the volume of the lattice contains a nonzero lattice point.
So after this we will know such a set it will contain a lattice point and using a simple sphere as the set allows us to put a bound on $\lambda_1(\mathcal{L})$.

Let's get to it then!
First we blow up our lattice to twice its size along every dimension!

Now we add our centrally-symmetric convex set $S$. Again in blue.

And because we picked the volume of $S$ to be bigger than $2^n$ times the volume of the lattice we still get the colliding points from the last theorem EVEN in the double size lattice!

But since our set $S$ is symmetric about the origin if $z_2 \in S$ it follows that $-z_2 \in S$ and because it is convex $z_1, -z_2 \in S$ implies $\frac{z_1 - z_2}{2} \in S$. And because we've double the size of the lattice and $z_1 - z_2$ is on the bigger lattice it follows that $\frac{z_1 - z_2}{2} \in \mathcal{L}$ and we have shown that a point in our set is also in the lattice.

You can see it quite clearly in the next picture.

As already stated above if we use a simple sphere for this set we can give a bound on $\lambda_1(\mathcal{L})$ based on the volume of $\mathcal{L}$. It is known as Minkowski's First Theorem and states:
\lambda_1(\mathcal{L}) \leq \sqrt{n}(vol(\mathcal{L})^{1/n}.
Isn't that great?!

If you have now completely fallen in love with lattices but would appreciate real math instead of pictures, worry not!
You will find steps 3 - 14 on Oded Regev's course website! Enjoy!

And in case you accidentally run into him at a conference here is the picture from his website so you don't miss the opportunity to say hello.

Saturday, March 4, 2017

GMP - Arithmetic without limitations

Theoretical research in cryptography can be extremely exciting but, on the other hand, a lot of effort should be devoted to correctly and securely implementing cryptosystems that help protect people's privacy. The presence of a bug in a software can be very annoying but if this software deals with crypto and security, then the consequences can be catastrophic.

For many cryptosystems based on number theoretic problems, one of the first issues that a developer faces is the size of the numbers involved in the computations. These numbers are composed of thousands of bits (or hundreds of digits) and they need to be added, multiplied, raised to some powers, etc. Performing these operations is complicated, and doing it efficiently is even more complicated. But thankfully there are tools (libraries) that come to the rescue.

In this post, I will talk about GMP, The GNU Multiple Precision Arithmetic Library. Quoting from the library's homepage, "GMP is a free library for arbitrary precision arithmetic, operating on signed integers, rational numbers, and floating-point numbers. There is no practical limit to the precision except the ones implied by the available memory in the machine GMP runs on. GMP has a rich set of functions, and the functions have a regular interface." This library is written in C but also offers an interface for C++ code, and we are going to focus on this particular one. The final goal of this post will be to implement a simple but complete example of an RSA cryptosystem, from key generation to encryption and decryption of messages.
But, before moving on, a little disclaimer: implementing cryptographic solutions is not an easy task and using homemade solutions is generally considered as not a good idea because of bugs and lack of scrutiny by other members of the community. The sole goal of this post is to show some of the potential of the GMP library.

  • Data type -- the basic data type that we will deal with is mpz_class. We can think of it as an arbitrary-sized integer. Some of the functions we will talk about only accepts parameters of type mpz_t, which is the corresponding C type. Luckily, we can always call the function get_mpz_t on a mpz_class object to obtain the corresponding mpz_t object;
  • Generating random numbers -- first of all, we consider the problem of generating random numbers with a specific size (measured in bits). GMP offers a function called get_z_bits, that takes in input a number of bits and returns a number with that many bits. But be careful! It means that the returned number takes up to that number of bits. So calling it with argument '3' means that the returned value will be a number between 0 and 7. Instead, what we want is a number that is represented exactly by 3 bits, so 4, 5, 6, or 7. Doing that is quite straightforward: given a number of bits k, we want a random number in the range $\left[2^{k-1}, 2^k\right)$. In order to handle this exponentiation (since $k$ can be very big), we are going to use the function mpz_ui_pow_ui, that raises an unsigned int basis to an unsigned int power and writes the result to a mpz_t object. In order to draw a random number, we are going to initialize a random number generator with this code:
    gmp_randclass rng(gmp_randinit_mt)
    Note: this example uses the Mersenne Twister PRNG, which is known to be insecure for cryptographic applications. After this is done, our function for generating a random number with a specific number of bits can look like the following
    mpz_class generate_rand_exact_bits (const unsigned int size)
        // 2^(size-1)
        mpz_class out;
        mpz_ui_pow_ui(out.get_mpz_t(), 2, size-1);

        // 2^(size)
        mpz_class bound;
        mpz_ui_pow_ui(bound.get_mpz_t(), 2, size);

        // random offset
        mpz_class rand = rng.get_z_range(bound-out);

        return (out + rand);
  • Generating prime numbers -- generating random numbers is nice but often not sufficient; we need to generate random-looking prime numbers. This task is obviously harder because there are more constraints that have to be satisfied. Thankfully, GMP comes to the rescue once again: we can in fact take advantage of the function mpz_probab_prime_p that takes as input a mpz_t object that represents the candidate prime and a number that specifies how many times the Miller-Rabin primality test has to be executed, and returns 0 if the candidate is not prime, 1 if it is probably prime and 2 if it is surely prime. Repeating the test 50 times gives a very high degree of confidence on the primality of the candidate, so we are going to generate a random number and test it until we find a candidate that passes all the tests. The code for a function that performs this task can be the following
    mpz_class generate_prime_exact_bits (const unsigned int size)
        mpz_class candidate;
        do {
            candidate = generate_rand_exact_bits(size);
        } while (mpz_probab_prime_p(candidate.get_mpz_t(), REPEAT_MILLER_RABIN) == 0);
        return candidate;
  • Modular exponentiation -- the goal of this step is to take a basis $b$, an exponent $e$ and a modulus $m$ and compute $b^e \mod m$. GMP already offers a function called mpz_powm that does what we need, but we can "beautify" it a little bit by wrapping it in another function like the following:
    mpz_class mod_exp (const mpz_class base, const mpz_class exp, const mpz_class mod)
        mpz_class out;
        mpz_powm (out.get_mpz_t(), base.get_mpz_t(), exp.get_mpz_t(), mod.get_mpz_t());
        return out;
  • Encryption and decryption -- once we have the function mod_exp for modular exponentiation, RSA encryption and decryption are trivial:
    mpz_class rsa_encrypt (const mpz_class pk, const mpz_class m, const mpz_class mod)
        return mod_exp (m, pk, mod);

    mpz_class rsa_decrypt (const mpz_class sk, const mpz_class c, const mpz_class mod)
        return mod_exp (c, sk, mod);
Now, we will put everything together and in the main function we will do the following:
  1. generate two numbers rsa_p and rsa_q which are prime and composed of an arbitrary (here 2048) number of bits;
  2. multiply together rsa_p and rsa_q to obtain the modulus rsa_mod;
  3. calculate the Euler's totient function of the modulus by multiplying (rsa_p  -1) and (rsa_q - 1);
  4. choose a public exponent rsa_pk: here we are going to use 65537;
  5. compute the secret exponent rsa_sk by calculating the multiplicative inverse of rsa_pk modulo rsa_totient. For this task, we can use the function mpz_invert offered by GMP;
  6. generate a random 32-bits message;
  7. encrypt it and decrypt it to show that the message is recovered correctly
The full code is available here. Assuming that it is saved in a file called main.cpp, it can be compiled with the following command:
g++ -Wall -o main main.cpp -lgmp -lgmpxx

Finally, here is a possible output of the program:

p: 23792449500522212951496314702038682121347194317959904876008718825662914075875899470581320800653437713536069408253411676817620646430212407540403369694134781759107125478056393908946127803961090931948553568336876469309822272887045095561051090700123699901361196085633529272849875353866567000752182158071024473204684307004694765510326847369963753315267894351158697231078788807690227802109570252088084877486698404206015111529669790467025281249212356222609928349702116676081721161733842190421613247070310376995600752548098115546539048072846637620536666595226297844398938704443953343779702008604938853732596401295126782941987

q: 29112805945439121371217246384724375309040486330485414315973357532238818896560861590487525988011993439077520550311524035984480970497774071062272700839286098068182559422588276697743896386738257138746524032717815962171954684796109782069096410172414189261142167684350607916360394902788649692558546060677368837462937413290555419196322614840219369972120722061013641236619226520826654042088554099043918184746705620357166310881550460726728460067235575345180739677700783783242815237362918214982959525838694224404066952951953880663045068810920848575666730532567285210572753317549500155052830379131918899635413209365556429526661

Modulus: 692664965275363134868164310308043386530889977137116066460956916982191344093600913377382983586757778122293014614580541617686926517513948142377215870615527742653194700493457784251321237254925462486660424748287684799960005285232140262663908204076330275297015032277805385557802277182249107190208144962072627103333702166712562912412455374055377823609059561406201023023953754548932692770545184532804691011451566435370564027281954154745605054059335653411252856827668944053539165109486066934327992280592181117681540307774187890207024741310917968286855939570200842820225004987124097822236170460663883584621535528990367892742357001025439624939219952367312627679661863266398010079250545044069312072321631423895274591504669908947731669841676290437961150479342566664751624244101084250756141019398564481030304316812644414619975245318775213481846709606949913834082310037698177108248900032520946078124914227912956069612730720184002438971635446900598624230868985779777067182408067527262398836477646176180149265564531633256045959807480886640631812690951578507074608608075924577570127277599939833799849839997452314926854023853061044555560190656563707831534988797823714162801693363417381512465247240619960211197273430707134124687895683306648515432815407

Totient: 692664965275363134868164310308043386530889977137116066460956916982191344093600913377382983586757778122293014614580541617686926517513948142377215870615527742653194700493457784251321237254925462486660424748287684799960005285232140262663908204076330275297015032277805385557802277182249107190208144962072627103333702166712562912412455374055377823609059561406201023023953754548932692770545184532804691011451566435370564027281954154745605054059335653411252856827668944053539165109486066934327992280592181117681540307774187890207024741310917968286855939570200842820225004987124097822236170460663883584621535528990367892742304095769993663604897238806225864622231475585749564760058562967711410339349194662834205744716004477795118079883111354725159048862414580186148948173567663370928851334497919810423614292621945066549280167717720521050364932649266758956452162536825639219086396668750961940935703957656300852919419991965254045660967825180303374046162336317566884059120678910850226498009948160851632383720333508904913956745247482616068631268540358255880854866759476646002336609572536933340525303598355554521449451080152039954160522951063655835325404680939946676605489966289587929410275548597966757698440898319397266934527673695987832220346760

Message: 2699077189

Ciphertext: 681377152725050864187889498396285584066530595533108416812276566730393956984527765118715661493472361295191987186484833223806930407591255557303290975217436524165953995313143542640632193731686512007342253568013661854913604989704338096439627987512846172004801196181232822308088685943243400345975426883909096147339045273787325984512823244957423398055174828695879849942493712502158108081782163024167101957978799078692054106581518638832961868215165854308314877242675212188837955873975174727647247990393606326403876490500815929414980002030276198625866328310973421532151059071628038306926281825502507926893439321472535762059281448571817796845306517579637703981694846774635322811199431875663174162751403156127612351378340405058549628085390124779083962094777489309709766752481986327599769351343828713992159924065615106068412838813886565270542810886921193120790895042994364186375596015716254331672407180666306582326345460867181975252903332881845979999145141098787286539117576791635044116865176716309063466576180971588107298159543491564320012605624880056428982408094160945322909734375254637560779872935691072801422766106105016077161030768880774211449289008906840655161792691129673110760644874000744631018536155622276779954270891913007605935768473

Decrypted: 2699077189

Monday, February 27, 2017

Side-channel analysis

For many years cryptographers and industry implemented crypto algorithms on microcontrollers and ASICs simply by translating the algorithms into assembly or rtl language. Only thing that mattered was correct functionality of the design. However, in mid to late nineties some interesting attacks of cryptographic implementations were shown, that were able to completely break the security of it, reveal secret keys, while the underlying cryptographic algorithm remained mathematically unbroken. First, the timing attack was introduced. Namely the attacker was able discover information based on the time server took to respond to a query since this time depended on the secret key that was used. Most simple application of this is square and multiply algorithm used for RSA exponentiation. Namely, for every bit of the exponent square, or square and multiply operation is used. This will result in different execution times for different exponents.

Kocher et al. introduced new type of side-channel attack  by revealing that information obtained from the power measurements can be also used to extract secret values from devices such as smart cards. We start from assuming that the power signal consist of random noise and deterministic one dependent on the processed value $x$

               $ P = L(x) + R$

Random noise is modeled as a Gaussian distribution with zero mean, and deterministic key dependant value $L(x)$ is usually hamming weight or hamming distance of the value . Following on, we choose intermediate value to be targeted. Good target of the attack for symmetric designs is S-box output, since small hamming weight difference in the input of the sbox leads to not so small hamming weight difference in the output. In the original work it was simply a one bit value, that could be either one or zero. The simplest example is thus by using one output bit of the S-box, with the assumption that power consumption is different for one and zero value. For example, take one S-box operation of the first round of AES. It is calculated using equation below

$SO_i = S(pt_i \oplus key_i)$

 Since we are unaware of what is the correct key value, we construct key guesses and calculate the targeted value, e.g. first bit of the S-box output. Now, for every key guess, we separate power traces into two groups, one that contains power traces for which the key guess produces target value one and other where target value is zero. Lastly, we compute the average of two groups, reducing the influence of the noise, and then we subtract these averages. If our key guess is wrong, difference of means will be negligible, since the power traces in two groups don’t correspond to correct values, and are thus uncorrelated. However, the correct key guess will have a noticeable spike in the point in time where first bit of S-box output is computed.

This very simple, yet very effective attack opened a new area of research. Soon, more elaborate attacks were devised, attacking bytes instead of bits using correlation. More advanced attacks include correlation with variance instead of the mean, or even higher statistical moments. These attacks are usually unpractical for orders higher than two, since the noise influence hardens the attacker’s possibility to retrieve the secret value. Profiling of the devices is sometimes performed in order to devise a more accurate model of the deterministic leakage functions. These attacks are known as template attacks. Designers also came up with various ways of trying to thwart these attacks, such as increasing the noise level by adding redundant circuitry or operations, and shuffling the execution operation. Other methods include secure logic design styles such as WDDL, where logical cells are designed to consume amount of power regardless of their output. Methods on the algorithmic level, e.g. Threshold Implementations, try to ensure no information leakage happens during the execution of the algorithm, regardless of the underlying leakage model of the design technology.

Since classical DPA and correlation analysis involve constructing guesses for all possible key bytes/nibbles, it can be very time consuming. This is where leakage tests became prominent. They tell if a set of randomly chosen traces can be distinguished for set of traces with fixed input value. It works in a very similar way to one bit DPA. 2 Sets of traces are obtained, one with random inputs and one with fixed inputs. Welsh t-test is used to measure if two sets can be distinguished from one another. If $\mu, \ s$, and $n$ are mean, variance and cardinality of a set, t values are calculated as follows

              $  t = \frac{\mu_0 - \mu_1}{\sqrt{\frac{s_0^2}{n_0} + \frac{s_1^2}{n_1}}}$

Approximating statistical degree of freedom with $n=n_0 + n_1$ we can with confidence 0.99999 claim that two sets are distinguishable for $t > 4.5$.  This type of leakage tests allows the designers to be more efficient since it reduces the testing time of the design. Downside of the approach is that leakage tests do not reveal how can the secret values be extracted, only that there is a possibility for an attacker to do so.

In best paper of CHES 2016: Differential Computation Analysis: Hiding Your White-Box Designs is Not Enough, side-channel analysis has been successfully used to break many white-box implementations. This time analysis has been performed on stack layout, instead of power traces, but the principle remained the same. Taking into account that white-box cryptography is currently being widely considered to be added to many software based designs, together with plethora of IoT devices that require protection, side-channel analysis importance will continue to be an important aspect of design process of cryptographic implementation. Thus, researchers will continue to work on improvement of leakage detection and key recovery methods to further reduce cost and time to market of new devices. 

Sunday, February 19, 2017

Differential privacy

These days one reads a lot about the right to privacy. But what is it and how does it differ between the real and the digital description of the world? Briefly, it is a person's right to have and more importantly maintain control over information about oneself, which is fundamental to human's freedom of self-determination. In the digital context one seemingly lost this right in favor of using free services, handy tools that satisfy people's urge to communicate with each other and stay in touch and up-to-date at all times. Since more and more people realize that behind such apps and web pages naturally there are business models, society increasingly demands to reclaim control over collected personal data, which has been provided, unsolicited or involuntarily, to online merchants or telephony service providers at the time of use, for example. On the other hand collecting user information in databases is crucial as corporations offering named services see it. In order to make good products, tailored recommendations, and especially in the age of big data and machine learning, precise predictions by evaluating various functions on the data.

Statistical database queries have been studied quite some time now, and in fact it turns out that often it is sufficient to allow query access only to a population's aggregate data, not individual records, to derive useful statistics and achieve desired functionality! A common approach is to merely allow aggregate queries (i.e. range, histogram, average, standard deviation, ...) and rather than returning exact answers about sensitive data fields to specify intervals or give imprecise, statistically noisy counts.

You might have asked yourself, don't cryptographic techniques that have been omnipresent in this blog such as FHE (Fully Homomorphic Encryption) or secure MPC (Multi-Party Computation) solve this problem? Wouldn't it be possible to encrypt the user data yet for the service provider to compute useful statistics on it?
Indeed, it could be realized with the general FHE, MPC toolkit, but currently it is inefficient to operate them at that scale in practice, such that statistics over huge quantities of data are useful to infer statements about a given database. Hence specific, more slender tools have been designed to overcome this gap. Whereas FHE avoids a trusted 3rd party to compute (i.e. arbitrary functions or sophisticated statistics) on users sensitive data, here typically one explicitly allows a trusted 3rd party to collect and aggregate data in a privacy-preserving fashion. Users might do so i.e. when installing an app and argue to have an advantage for themselves like good default settings, an overall performance gain; or it might be a requirement to share information in order to use the service for free in the first place.

Differential privacy (often abbreviated DP) is a framework for formalizing privacy in statistical databases. It can protect against so called de-anonymization techniques that try identifying an individual record by linking two separately released databases that have been stripped off (quasi-)identifiers and look innocuous. Especially apriori knowledge or known partial history can be leveraged to derive more information from a released "anonymized dataset" other than the purpose it was originally intended to serve.

Let's look at a mathematical definition that captures and formalizes the notion of privacy and which has been studied in cryptography in the past 10 years. Let $d,n$ be a positive integers and $f: X^n \rightarrow \mathbb {R} ^{d}$ some statistics on a database comprised of $n$ records.

An algorithm $\mathcal {A}: X^n \rightarrow \mathbb {R} ^{d}$ that computes $f$ is said to have the $(\epsilon, 0)$-differential private attribute or ($\mathcal {A}$ is $\epsilon$-DP, for short) if for all neighboring subsets of a given database $x_{1} \neq x_{2}$ and $x_{1} \sim x_{2} := x_{1} \sim_1 x_{2}$ (they differ in just 1 element), and all subsets $S \subseteq \mathbb {R} ^{d}:$
\mathbb{P}[\mathcal{A}(x_{1})\in S]\leq e^{\epsilon } \cdot \mathbb{P}[\mathcal{A}(x_{2})\in S]
holds. Looking more closely at this definition $\forall x_{1}, x_{2} \in X^n, x_{1} \sim x_{2}:$
$$\mathbb{P}[{\mathcal{A}}(x_{1})\in S]
\leq e^{\epsilon } \mathbb{P}[{\mathcal{A}}(x_{2})\in S] \Leftrightarrow \frac{\mathbb{P}[{\mathcal{A}}(x_{1})\in S]}{\mathbb{P}[\mathcal{A}(x_{2})\in S]} \leq e^{\epsilon }\\ \Leftrightarrow \log \left(\frac{\mathbb P[\mathcal{A}(x_{1})\in S]}{\mathbb{P}[{\mathcal{A}}(x_{2})\in S]}\right) \leq {\epsilon}$$
we can identify the so called "privay loss" of an algorithm (or in this context often called mechanism) $\mathcal {A}$. In this setting $\epsilon$ can be called the privacy budget. In less exact terms it captures the following: By specifying the privacy budget, it is possible to control the level of privacy and make an algorithm respect this additional constraint by techniques introduced below.
For those familiar with the concept of max-divergence, the definition of privacy loss is in fact the definition of $$D_\infty( A(x_1) || A(x_2)) := \max_{S\subseteq {\textbf supp}(A(x_1))} \log \left(\frac{\mathbb P[\mathcal{A}(x_{1})\in S]}{\mathbb{P}[{\mathcal{A}}(x_{2})\in S]} \right).$$
Furthermore, the multiplicative factor $e^\epsilon$ can be -- using a common approximation for small $\epsilon<1$ -- viewed as $1+\epsilon$:
$$e^\epsilon = exp(\epsilon) = 1 + \epsilon + \epsilon^2 + \dots \approx 1 + \epsilon.$$
In less formal terms this definition says that a given result is approximately the same whether it is computed using the first or the second, neighboring database.
A more general definition, that adds flexibility -- but also makes the proofs less elegant and more technical -- is $(\epsilon, \delta)$ -differentially privacy, when
$$\mathbb P[{\mathcal {A}}(x_{1})\in S]\leq e^{\epsilon } \cdot \mathbb P[{\mathcal {A}}(x_{2})\in S] + \delta.$$
Interpreting the definition, the goal of DP is that the risk of violating one's privacy should not substantially increase as a result of either appearing in a statistical database or not. Thus an analyst should not be able to learn any information about a record (i.e. participating an online questionnaire) that couldn't have been learned if one had opted not to participate or answered the questions randomly by rolling a die or flipping a coin rather than answering truthfully.

To overcome the fundamental challenge -- the trade-off between utility of data or accuracy of returned answers and privacy of records -- the set goal is to learn as much as possible about a group's data while revealing as little as possible about any individual within the group. Transforming an algorithm into a DP-algorithm requires probabilistic tools. The sensitivity of a function of $f$ is a good measure of how much statistical noise is needed to mask an answer:$$\Delta f=\max_{x_1 \sim x_2} ||f(x_{1})-f(x_{2})||_{1} = \max_{x_1 \sim x_2} \sum_{i=1}^d |f(x_{1})_i-f(x_{2})_i|.$$Low sensitivity of a function, i.e. small change of output given two neighboring inputs, allows to add statistical noise to achieve privacy yet don't use utility. The Laplace mechanism, adds noise from the Laplace distribution $\mathcal L(\lambda)$, i.e. noise $\eta(x)\propto \exp(-|x|/\lambda)$ which has 0 mean and $\lambda$ standard deviation. Substituting Laplace noise with other probability distributions, such with a 0 mean Gaussian and $\lambda$ standard deviation would be possible, but influences proof details.
A typical construction now is, instead of computing $f(x)$ directly, to compute $\mathcal A(x) = f(x) + \eta(x)$ and obtain a $\epsilon = \frac{\Delta f}{\lambda} $-DP algorithm, since the noise of two neighboring databases doesn't exceed this $\epsilon$ even in the worst-case. In terms of composability, sequential respectively parallel composition leads to a sum of all occurring $\epsilon_i$ resp. maximum of all occurring $\epsilon_i$ differentially private steps within the composed mechanism. This allows to efficiently turn algorithms into DP-algorithm.

These basic construction, including detailed proofs, and much more were covered during the 7th BIU Winter School on Cryptography named "Differential Privacy: From Theory to Practice" featuring speakers, who defined and contributed already roughly 10 years ago to the field. Slides are already online and video recordings about to appear on the webpage.
Furthermore, relationships of DP to various, related scientific fields ranging from statistics to machine learning and finally game theory were explored.

Concluding, in wake of the growing awareness of privacy issues in the digital domain, together with stricter interpretation of legislation and finally the possibility to satisfy most interests by anonymized data anyways; several big players strive to provide differentially private collection of data. Some companies market themselves as quasi-pioneers in privacy topics for some reasons: It pays off to be perceived as the first one; they would be facing various problems in the near future anyways, if they don't respect these issues; and most importantly, they can continue their business model: creating value of user's data. The more information is queried from the database, the more statistical noise has to mask the correct answer in order to meet a predefined privacy budget bound. This total allowed, justifiable privacy leakage can be specified in the number of admissible queries or the answer accuracy.

Provable cryptography avoids the situation of mere obfuscation that can be undone by a clever enough attacker / strategy -- given the security assumption holds -- and provides bounds and thus a guideline on how to choose parameters to guarantee a desired level of privacy. Algorithms invest a given privacy budget at privacy-critical steps. With this in mind, differential privacy is an additional design paradigm for cryptographic algorithms and protocols to keep in mind.

I'd like to end this cryptologic point of view on achieving privacy goals on the internet, as I started; with a fundamental sociological question. One thought that remains standing out is: Shall we collect as much data in the first place? Is it really necessary to predict individuals as online merchants? Do we want this ubiquitous tracking? As with advanced technologies, who's long-term effects cannot be predicted, maybe also in aggregating big data and tracking the only winning move seems to be not to collect data in the first place.

Friday, January 27, 2017

Symmetric Key Ciphers for FHE

In recent years, small devices with limited storage and computing power have acquired more and more popularity. This phenomenon, together with the appearance of cloud services that offer extensive storage and high computing power, has made computation outsourcing a key application to study. This setting, where a user with limited resources stores data on a remote server and also asks this server to make some computation on his behalf, comes together with new challenges and concerns related to privacy and security.

A simple scenario

The user Alice encrypts her data under a homomorphic encryption scheme and sends the ciphertexts to the server (eg. cloud). Now the server computes some function (eg. mean value) on Alice's encrypted data and sends back an encryption of the result, which can be decrypted by Alice with the appropriate key. Despite its simplicity, this model still requires a lot of effort from the user. In fact, all the FHE schemes that we know of have very large keys and heavy encryption overhead. The most natural approach is then trying to free the user from the burden of complex homomorphic computations, by asking her to do only the key generation part, the symmetric encryption and the final decryption, while leaving all the rest to the powerful server.

A hybrid framework

The user Alice generates a secret key $sk^S$ for a symmetric encryption scheme, a pair $(pk^H, sk^H)$ for a homomorphic public key encryption scheme, and sends $pk^H$ to the server together with an encryption of $sk^S$ under $pk^H$. Then, Alice encrypts her data under the symmetric encryption scheme and sends the resulting ciphertexts to the server, which homomorphically evaluates the decryption circuit of the symmetric encryption scheme, thus obtaining ciphertexts under the homomorphic scheme. This last operation can be seen as a sort of bootstrapping (from one scheme to another one), but does not involve any circular security assumption. At this point, the server can homomorphically compute a certain function $f$ on the resulting ciphertexts and send back the result, encrypted under the homomorphic encryption scheme. Alice can finally recover the result thanks to her $sk^H$.

Despite numerous improvements in recent years FHE remains highly unpractical, and one of the main problems is limited homomorphic capacity. In fact, all the FHE schemes we know of contain a noise term that grows with homomorphic operations. The homomorphic capacity corresponds to the amount of computation that can be performed before having to ``refresh'' the ciphertext via the expensive operation of bootstrapping. It is thus of vital importance to manage the error growth during homomorphic operations.

State-of-the-art ciphers

AES is a natural benchmark for FHE implementations. The homomorphic evaluations of some lightweight block ciphers such as SIMON have been investigated by Lepoint and Naehrig as well. It turns out that there is much room for improvement in optimizing the multiplicative depth. In this direction, many dedicated symmetric key ciphers with low-noise ciphertexts have been proposed.

  • In 2015, Albrecht et al presented a flexible block cipher LowMC, which is based on an SPN structure with $80,128$ and $256$-bit key variants. The design philosophy is to minimize the multiplicative complexity, which is more relevant in the multi-party computation context. It also has very low multiplicative depth compared with existing block ciphers.

    Later, Dinur et al mounted optimized interpolation attacks against some instances of LowMC. More precisely, some $80$-bit key instances could be broken $2^{23}$ times faster than exhaustive search and all instances that are claimed to provide $128$-bit security could be broken about $1000$ times faster than exhaustive search. These attacks show that the ciphers do not provide the expected security level. The cryptanalysis also led to a revised version of LowMC.
  • Keyvrium was designed by Canteaut et al and is a stream cipher aiming at $128$-bit security. It is based on the lightweight stream cipher Trivium, which has $80$-bit security. Trivium and Keyvrium both achieve similar multiplicative depth compared to instances of LowMC. They have a smaller latency than LowMC, but have a slightly smaller throughput.
  • The FLIP family stream ciphers proposed by Méaux et al have the lowest depth compared with previous ciphers. Several versions are provided including $80$-bit and $128$-bit security instantiations. The main design principal is to filter a constant key register with a time-varying public bit permutation, which allows for small and constant noise growth. However, FLIP has a much higher number of ANDs per encrypted bit than the above ciphers.

To sum up, blending symmetric ciphers and FHE is definitely an interesting idea that could prove fundamental for bringing homomorphic encryption in the real world.

The other fellows also made contributions to this blog post.

Monday, January 16, 2017

WhatsApp backdoor? Should I be worried?

WhatsApp Backdoor!!! 

Let's start this blog entry with that title and it is very easy to get reader's attention (Especially in the security community) with big bold letters and saying that there is a backdoor. I say this because it was exactly what happened on January 13th (Friday the 13th, maybe it is correlated) when The Guardian released the news. After that, the security community/websites and others started a big debate about it. In fact, some hours after the release from the Guardian others websites started saying that it isn't a backdoor such as: Gizmodo, Open Whisper System blog and others' websites.

Well, we are going to give some "conclusions about the subject" later. First, let's explore more about it.

What is a backdoor?

If we talk about computer systems or cryptography, we are going to have something like this definition: "A backdoor is a method, often secret, of bypassing normal authentication in a product, computer system, cryptosystem or algorithm etc. Backdoors are often used for securing unauthorized remote access to a computer, or obtaining access to plaintext in cryptographic systems." In other words, it is an easy way to obtain access.

Are backdoors common?

Unfortunately, yes it appears in the news and computer systems more than we want. Also, after the Snowden revelations the term backdoor is very common in the non-academic area. However, it does not mean that people become more worried.

What is the real problem?

Perfect, we learned about backdoors. Now, let's explain the problem with WhatsApp. Let's see an hypothetical case, imagine that you have a smartphone and for some reason (fall in a lake, you have been abducted by E.T. and forget your phone in the spaceship) you don't have access to your smartphone anymore. Fortunately, you can recover your phone number, buy a new smartphone and reinstall Whatsapp. However, when you reinstall WhatsApp, it generates a new key. However, now we have a problem what the app should do with the messages that you received when you were offline. The app could just drop the messages and never deliver or it could ask for your contacts to encrypt with the new key. According to the Guardian this approach is the backdoor:
"WhatsApp’s implementation automatically resends an undelivered message with a new key without warning the user in advance or giving them the ability to prevent it."

I, personally, say that it is a mistake that WhatsApp just show a notification saying that the key change and automatically resends.

Is it exploitable?

I will say: Yes, it is.
Let's see how it is possible to exploit.

In fact, it consists in exploit another vulnerability, it is present in the SS7 protocol. The Signalling System No. 7 (SS7) is a set of telephony signaling protocols developed in 1975, which is used to set up and tear down most of the world's public switched telephone network (PSTN) telephone calls. It also performs number translation, local number portability, prepaid billing, Short Message Service (SMS), and other mass market services.

The SS7 vulnerabilities are known since 2008. However, it had the attention from media in 2014, Tobias Engel at CCC showed how it is possible to exploit SS7 vulnerabilities. You can check exactly how in this link. In the video, he shows that it is possible to change the phone number between cellphones. In the same congress and right after Tobias, Karsten Nohl showed how to read "others SMS", the talk can be followed in this link.

What is the SS7 vulnerability?

Let see very quickly the structure of a GSM network, the next figure shows it.

So, if you are inside of the SS7 network, you can just start listen the network and as we saw in the videos it is possible to change somethings such as your number.

Since it is possible to change your number and the verification from whatsApp can be done by call or a SMS verification. What happened if I have access to SS7 network and change my number to any other number and I reinstall WhatsApp and pass the verification?

According to the policy of WhatsApp, my smartphone is going to generate a new key and all the new messages are going to be encrypted with this new key. I didn't create a new attack here, it has been done before as this youtube video shows:

It is valid to say that this attack can be used in others messengers. However, others apps have another policy, e.g.  Signal drops the messages until you "recheck" the new key.

 If you are interested to see the how messengers handle this key problem, there is a nice blog entry at Medium.


The news from Guardian is not a real backdoor, it is a known problem about handling with keys and the news was more like a "click bait". However, the good thing about all of this history are the discussions in different communities and an alert to people take care about their privacy.

All this incident makes us to start doing some social and technical questions, such as:
Should we accept the news from the media as truth?
Why the hell we continue to use a system from 1975?
Can we create a better way to handle with keys?

 In addition, I should say that if you are looking for more privacy to your life. You should check Privacy Tools, in this case, the section about "Encrypted Instant Messenger".