ECRYPT-EU: December 2015

Friday, December 18, 2015

Codes and Coding

Dear readers, hi fellows,
as my master's thesis was about Linear Codes and Applications in Cryptography I'd like to provide some content of it in a nutshell and to give some recommendations of software that I used in the course of writing it.

multiple scientific fields are involved (created using TikZ)

Modern cryptography covers among others the three areas privacy, authentication and integrity of messages. The aim of coding theory on the other hand described by one word is reliability - adding redundancy to messages in order to detect transmission errors. Both fields cover seemingly different areas in the field of digital information processing.

The motivation of my thesis was to showcase the combination of multiple scientific fields within coding theory and modern cryptography and to explore how they interact in the domain mentioned above and why they are a good choice.

In the center of attention stood the McEliece public-key cryptosystem, it's variation (Niederreiter's cryptosystem) and generalizations to other code classes after those code classes of interest were introduced and discussed.

The McEliece public-key cryptosystem was published in 1978, is based on disguising a special code as a general code and it is of interest back then and today because it seems to stand solid in the so-called post-quantum era, when a powerful quantum computer is available to attackers. More than 35 years ago the original code parameters: codeword length $n = 1024$, code dimension (or message bits) $k = 524$ and error-correcting capability $t = 50$ were proposed for the binary code underlying the McEliece public-key cryptosystem for $2^64$ bit security. To achieve the a reasonable security level today they have been adjusted to $n = 6960, k = 5413, t = 119$ because of attacks that were explored since then.

Given the number $t$ of erroneous positions and a general code $C$, the task of decoding $y = c + e$ is to find a codeword $c \in C$ of Hamming weight or number of non-zero entries: $w(y-c) = w(e) = t$. In the complexity theoretic literature the result is that this problem is NP-complete - there is no algorithm known that fulfills this task in polynomial time depending on the input size (length of $y$ respectively $c$).

The choice to use a Goppa codes as private key, which is multiplied by permutation matrices to obtain the public key looking as if it was a general code, was due to performance reasons when decoding messages thanks to Patterson's algorithm.

Interestingly many attacks are not applicable to the class of binary Goppa codes that McEliece's scheme was based on but on similar schemes and generalizations allowing other alphabets or other structures - binary Goppa codes were seemingly proposed with foresight.

In spite of some advantages and the - so far - quantum computer resistant attacks the biggest drawback of public-key cryptosystems based on codes are rather long public keys compared to other schemes. If one adds a security margin, under the assumption of a working quantum computer performing the general attack through the application of Grover's algorithm, the public key size is as big as 1 MB.

In the thesis I implemented the fast algorithm (by N. J. Patterson) for algebraically decoding binary Goppa codes in Sage and demonstrated the full functionality of the system in yet another chapter giving an example. Since I was writing it all up in $\LaTeX$, using sagetex even allowed the computation and insertion of the results into the $\verb!.tex!$ file one the fly when the document is being compiled!

My recommendation (to my fellows) for researching, analyzing, developing & testing of algorithms in a structured manner with the ability to create nice plots and documents rather easily is to give Sage or Spyder a try. Even though it might not be a necessity in your research to actually code something, playing around with these tools already available is instructive and worth a try.

Sage is available online or offline via installation and it is a good choice if you need access to i.e. number theory. Maybe the aforementioned Spyder is preferable if a somewhat more automated handling even closer to the programming language Python of the implemented primitive at hand is needed in your case.

Autumnal Bochum around Oct 31

As usual, a little plurivalent:
Happy Coding and Happy Halloween - because $Dec(25) = Oct(31)$, right?!

Tuesday, December 15, 2015

Spring School on Symmetric Cryptography

We are happy to announce the ECRPYT-CSA

Spring School on Symmetric Cryptography

The Spring school will be held in Bochum, Germany. Bochum is located within the famous "Ruhr Area", Germany's former primary location for coal mining and steel industry. Although most coal mines have been closed, the hidden beauty of these old days has been preserved in many places.

The Spring school takes place right before FSE 2016.

The goals of this school is to provide academia and industry with some of the necessary background to understand symmetric cryptography. On top, several recent developments in symmetric cryptography design and implementations will be presented.

Sunday, December 13, 2015

Post-Snowden Cryptography (PSC - Brussels)

On December 9 and 10, the Post-Snowden Cryptography Workshop took place in Brussels. The main topic of the discussion was cryptography and privacy after the revelations from Edward Snowden.

Day one

In the first day of the workshop, we learned about Qubes OS and how it works from Joanna Rutkowska. In the second talk, Kenny Paterson presented some disturbing facts about the NSA's capabilities from Snowden's revelations such as a rumoured "real-time breakage of RC4" and "ability to break certain unnamed VPN products". Also, he gave some solutions for the new problems of surveillance such as building cryptographic implementations that are "strong, fast and developer-friendly." Next, Phil Zimmermann gave a speech and answered questions. In the last talk of the first day, Jon A. Solworth presented the Ethos Operating System and how his group built the OS. He explained some important features of the OS such as how its networking is "encrypted, cryptographically authenticated, and authorized."

Day two

On the second day, Christopher Soghoian spoke about the revelations of the NSA and how the FBI uses this information, i.e. how law enforcement is the "little brother" that does the actions. Claudia Diaz presented some results about website fingerprinting on Tor, i.e. how to collect information from a client that is using Tor. Website fingerprinting consists of recognizing the traffic patterns of specific web pages without other information. Ian Goldberg presented DP5. DP5 allows clients to register and interact with friends without leaking metadata.

"DP5 is the Dagstuhl Privacy-Preserving Presence Protocol P â€” the extra P is for extra privacy" #PSC pic.twitter.com/eAgyLpxkB1
— GF(¯\_(ツ)_/¯) (@hdevalence) December 10, 2015

Next, Christian Grothoff talked about the development of GNUnet and how it is built. One little part of the work, i.e. the complexity of the development of any library he presented a dependency graph of GNUnet. Then, in the last part of the workshop, Jacob Appelbaum presented a "modest post-Snowden proposal." In fact, the correct term is post-Snowden revelations, since Snowden is still alive. In this talk, he presented a different view of the situation, that is, we did not need to hide the information. In the last talk of the day, Nathan Freitas presented an overview of the Guardian Project and his experience with development of secure apps for Android. He showed that the Guardian Project has SDKs for encryption in mobile applications such as SQLCipher to encrypt databases.
The event presented an ethical view of cryptography and which path cryptography is following. The moment for this discussion could be reached in better time, after the essay from Phillip Rogaway called "The Moral Character of Cryptographic Work," cryptographers need to pay more attention in their work. In simple words, this means that we should care about what we are doing and how this will impact society. We need to ask ourselves what kind of legacy we want to leave behind or what benefits we want to bring to the world in general.

Slides for some speakers' talks are available on the Post-Snowden Cryptography Workshop website.

No backdoors at #PSC pic.twitter.com/IbOTAMrgBx
—GF(¯\_(ツ)_/¯) (@hdevalence) December 9, 2015

Videos

The videos of the talks are available: https://psc2015videos.projectbullrun.org/

Friday, December 4, 2015

Workshop on Lattice Cryptography

It is the day after AsiaCrypt 2015 and there are two workshops being held in Auckland. The one which is most relevant for my research is that on Lattice Based Cryptography; which consists of four talks. One by Jung Hee Cheon on "Multilinear Maps and Their Cryptanalysis", one by Amit Sahai on "Obfuscation", one by Fre Vercauteren on "Weak Instances of RLWE" and one by Martin Albrecht on "Small Secret LWE".

Cheon first described a very naive version of multi-linear maps and then went on to show how this can be attacked by creating non-trivial encodings of zero, and then taking greatest common divisors. Then he went on to generalise this naive scheme to the CLT scheme (which is a bit like the DGHV FHE scheme). The naive attack does not apply to CLT as the dimension increased, meaning taking naive greatest common divisors would not work. Cheon then showed how to extend the naive attack to the CLT case by turning the gcd extraction into an eigenvalue extraction problem. This done by building quadratic forms which represent encodings of zero. The result is that for the CLT scheme one can break the equivalent of the DLP problem.

Cheon then went on to present the GGH scheme, which is a bit like the NTRU FHE scheme; except the instead of encrypting via c=[(m+r*p)/z] for an integer p, one encodes via c=[(m+r*g)/z] for a polynomial g which generates the ideal lattice <g>. Modifying the prior attack in this situation allows us to recover a basis of this ideal. But finding a short vector in this lattice can be hard. However, by utilizing encodings of zero one can actually solve the equivalent of the CDH problem.

Both attacks rely heavily on the presence of encodings of zero. So the attacks do not apply to situations in which one does not publish such encodings; i.e. applications such as indistinguishability Obfuscation (iO).

Amit Sahai then gave an introduction to iO; he motivated it via an analogy of an attacker who captures your brain and is able to read and tamper with every neuron, yet we still do not want the attacker to know what we are thinking about. This is the problem which obfuscation tries to solve in the computing realm. Martin pointed out that this would be a great way to produce malware!

Amit then put Multi-Party Computation within this analogy. He suggested we can think of MPC as protecting our brain against the tampering adversary, by dividing the brain up into portions. As long as one portion is kept out of the adversaries control we can use MPC to protect our thoughts. Obfuscation tries to do the same, without there needing to be an honest part of the brain.

Any program which is suitable for obfuscation must be unlearnable from query access to the program. Since otherwise the adversary could learn the program from the input/output behaviour. However, black-box obfuscation has been shown to be impossible; essentially because their are contrived programs which are unlearnable but for which one cannot produce an obfuscation, since any obfuscated program has an explicit attack against it.

This is why iO as a concept was presented; since it at least seems possible to achieve. The idea is that if you have two equivalent programs and we obfuscate one of them, then the adversary cannot tell which one we obfuscated. One way of thinking of this is as a psuedo-canonicalizer. The question is what useful can one do if we could create an obfuscator which satisfied the iO definition. Amit gave the application of building demo versions of software, without needing to re-engineer the software.

Fre Vercauteren then discussed a more in depth analysis of a paper from CRYPTO this year on Weak Instances of Ring-LWE. The CRYPTO paper gave instances where decision Ring-LWE was easy, but search appeared to be hard. However, Fre's talk showed that the search problem was in fact easy from the start, and thus the CRYPTO paper was less surprising than it at first seemed to be. As with all things on Ring-LWE the question arises as to how to choose the error distributions.

Fre spend the first part of his talk discussing the geometry of number fields, and in particular the Minkowski embedding. The Ring-LWE problem generates errors according to a discrete Gaussian distribution in the Minkowski embedding, Poly-LWE is to generate the errors according to a discrete Gaussian in the polynomial embedding.

Eisentrager et al discussed cases for which Poly-LWE was easy, these were then extended by Elias et al to special cases of decision Ring-LWE. They did this by mapping the special Ring-LWE instance to a special Poly-LWE instance.This is done by pulling back the problem from Ring-LWE to Poly-LWE via the matrix which defines the Minkowski embedding. The Poly-LWE attack requires that q is larger than f(1), and hence q will "kind of show up" in the coefficients of the defining polynomial f. So the fields being attacked, are very special indeed.

(This post originally appeared in the BristolCrypto blog)

Tuesday, December 1, 2015

Optimality of frequency analysis on deterministically-encrypted database columns

This fall at CCS '15, Naveed, Kamara, and Wright published a paper about their evaluation of some passive attacks on encrypted databases (EDBs). (There are posts about their paper on Kamara's blog and on the Bristol Cryptography Blog.) Their paper is important because it illustrates the power of simple attacks on columns encrypted with property-preserving encryption. All the attacker needs is a snapshot of the encrypted database and some relevant, publicly-available auxiliary data. In this post, I briefly explain an observation that Kenny Paterson and I made about the optimality of frequency analysis.

About the attacks

Two of the attacks Naveed, Kamara, and Wright evaluated, frequency analysis and ℓ_p-optimization, operate on deterministically-encrypted columns. (Of course, the appeal of deterministic encryption is that it naturally allows equality testing on encrypted data.) Both of these attacks are passive—they do not rely on seeing any queries to the EDB, and they require only a copy of it at rest. They do, however, require an auxiliary data set. In essence, an auxiliary data set is one that should have the same distribution as the target data (i.e., the data underlying the encrypted columns). Here's the main idea of the two attacks:

frequency analysis decrypts the column by matching the most frequent ciphertext with the most frequent plaintext from the auxiliary data (and so on for the other less frequent ciphertexts).
ℓ_p-optimization decrypts the column by matching the frequencies of the ciphertexts with the frequencies of the auxiliary plaintexts in a way that minimizes the ℓ_p-distance of their histograms. (See Naveed et al.'s paper for details.)

Attacking a deterministically-encrypted database column using this type of auxiliary data is like attacking a monoalphabetic substitution cipher given plaintext letter frequencies. Consider the following example: suppose you intercept the following ciphertext of some English message that was encrypted with a keyword cipher.

XKSREJJKQBKLPQKCSDHYECPQQPNVKNHYVDQBKSQDILNKUDJAQBPDJYDUDYSEHOQKQBEQPJYPERBKTSOISOQVKNGTKNBDOKVJDILNKUPIPJQEJYEQQBPOEIPQDIPOBENPEAPJPNEHNPOLKJODCDHDQXTKNEHHBSIEJDQXKSNLENQDRSHENYSQXCPDJAQKEDYQBKOPQKVBKIVPQBDJGVPREJCPIKOQSOPTSH

The simplest attack is well-known: just look up English letter frequencies and you can probably crack the cipher by matching the most frequent letter in the ciphertext with the most frequent letter in the alphabet (E in English). A slightly more powerful attack would involve looking at which pairs of letters occur most often next to each other, and using bigram statistics (most common bigram in English: TH) to crack the cipher. The question of whether frequency analysis is an optimal attack on a keyword cipher is just not relevant since there's other information to exploit.

In the EDB context, though, the letters (i.e., individual data values) are not ordered! The ciphertext would look more like this, with one "letter" per record:

Attribute 1
0x523e
0x0852
0xa8ef
0x0852
0x772a
0x523e

Even if you knew the alphabet (say, EU country names), higher-order frequency statistics (like of bigrams) just can't apply since there is no concept of order or adjacency.

Our observation

So, given a copy of an EDB column and an auxiliary data set that is correlated with it, what is the best we can do in terms of attacking a deterministically-encrypted column? What should we use—frequency analysis, ℓ_p-optimization, or something else? We propose applying maximum likelihood estimation (MLE) to answer this question. MLE is a statistical technique for finding the value of a parameter of a probability distribution function that maximizes the likelihood¹ of seeing some particular sample data having that distribution.

Why does it apply here? Well, we have samples of data that should have the distribution of the auxiliary data, but they've been relabelled (i.e., encrypted). The attacker wants to find the mapping between plaintext values (from the auxiliary data) and the relabelled values (from the encrypted target data). In other words, the distribution's parameter that we want MLE to find is a permutation.

Which permutation makes the target data most likely? The cool thing—and with hindsight, the unsurprising thing—is that according to the maximum likelihood principle, frequency analysis is exactly the optimal technique² that arises from maximizing the likelihood function!

You can read the details in the short note we posted on ePrint.

¹ What's the difference between probability and likelihood? Informally, probability matters when we're given a certain model and want to know about samples from it. Likelihood, on the other hand, matters when we're given the data, and want to know about the model (or distribution) from which it arises. Here's a simple example to illustrate the difference. Given a fair coin, we could ask (and compute) what is the probability of getting 2 heads when tossing it 10 times. On the other hand, suppose we have some coin of questionable origin. We toss it 10 times and observe 2 heads. Then, we could ask (and compute, using MLE) whether the coin is fair.

² MLE may be optimal, but it is not infallible. If the sample size is very small (i.e., if there are few entries in the column of target data), then MLE won't perform well.