Friday, December 28, 2018

Generic White-Box Attacks

Cryptography is deployed in almost all the objects (e.g., smartphones) to protect the communications between these objects connected in an open environment. However, the execution environments over them are usually untrustworthy, making cryptographic software implementations running in them are vulnerable to an attacker, who completely manipulates an execution platform and the software deployed on it. Specifically, an attack could try to extract the secret cryptographic keys by all kinds of means, for instance, by monitoring the accessed the memory while execution or by interfering in the execution and exploiting the leakage from erroneous outputs. This attacking model is called white-box model.

White-box cryptography


White-box cryptography is introduced by Chow et al. to protect against software cryptographic implementation against these white-box threats. In particular, it aims to make the key extraction infeasible to any malicious party that would gain full access to the program (and/or the At CHES 2016, Bos et al. proposed to use differential computation analysis (DCA) to attack white-box implementation. DCA is essentially an adaptation of the differential power analysis techniques in the white-box context. It exploits the fact that the variables appearing in the computation in some unknown encoded form might have a strong linear correlation with their original values. It works by first collecting some computation traces, composed runtime memory information through several executions through a dynamic instrumentation tool, such as Intel PIN. One then makes a key guess and predicts the value of a (supposedly) computed bit based on the guess and the computation input. The collected traces are then categorized into two groups according to the hypothetical bits, and a differential trace is calculated by subtracting the two average traces for each key guess. Finally, the key guess with the highest peak in the differential trace is selected as the key candidate.execution environment). Hence, white-box cryptography is considered as the last security frontier of the deployed software. Despite its practical interest, it has been widely acknowledged that no provably secure white-box implementation is put forward in the literature after almost 20 years exploration. Nevertheless, many different techniques have been proposed to mitigate this real-world security threat, but all these solutions have been broken by structural attacks. This situation has pushed the industry to deploy home-made white-box implementations, the designs of which are kept secret, to meet the increasing demands in the market. Although these implementations might not be not secure against a well-informed adversary, the security of their designs can make them practically hard to break since e.g. the known structural attacks do not apply as is. However, several generic attacks have been presented to break these obscure white-box implementations without gaining any knowledge of the designing principle behind.

Differential computation analysis


At CHES 2016, Bos et al. proposed to use differential computation analysis (DCA) to attack white-box implementation. DCA is essentially an adaptation of the differential power analysis techniques in the white-box context. It exploits the fact that the variables appearing in the computation in some unknown encoded form might have a strong linear correlation with their original values. It works by first collecting some computation traces, composed runtime memory information through several executions through a dynamic instrumentation tool, such as Intel PIN. One then makes a key guess and predicts the value of a (supposedly) computed bit based on the guess and the computation input. The collected traces are then categorized into two groups according to the hypothetical bits, and a differential trace is calculated by subtracting the two average traces for each key guess. Finally, the key guess with the highest peak in the differential trace is selected as the key candidate.

DCA has been shown especially effective to break many open white-box implementations and was extensively used as a white-box cryptanalytic technique in the recent WhibOx contest. At FSE 2016, Sasdrich et al. implemented Chow et al.'s white-box countermeasure in FPGA platform. They show that the classical DPA can reveal the secrets in hardware implementations of the white-box designs in the gray-box context, which extends the observation by Bos et al. Besides, the authors stress that the leakage of Chow et al.'s countermeasure comes from the imbalanceness of Boolean function modeling the intermediate variables appearing the computation. At ACNS 2018, Bock et al. give another reason why DCA works on previous countermeasures. However, their analysis is limited to nibble encodings, which could not be generalized to more complicated encoding techniques. A more in-depth analysis of DCA is expected by the community.

Summary


DCA is generic, automated and does not require any information about the design technique. But for a designer, there exist not too many ideas in the light of DCA attack. Besides, given that more DCA-like attacks appear in many following works and generic attacks, such as differential fault analysis are also destructive to practical white-box solutions, it becomes an even more complex problem. It is a challenging task to find a provably secure white-box scheme in the white-box context. However, to meet the needs of white-box secure solution from industry, we are forced to achieve security through obscurity for now. Although not perfect, it might be meaningful to consider to mitigate the gray-box attacks in the first place, for instance, by ideas inspired by the side-channel community.

Thursday, December 6, 2018

Design of Lightweight Linear Layers

Confusion and Diffusion

Confusion and diffusion introduced by Shannon are widely used two fundamental principles in the design of symmetric key primitives. Most modern block ciphers and hash functions have well-designed confusion and diffusion layers. Among many design strategies, substitution-permutation networks (SPN) have been popular in the design of block ciphers and hash functions. The best understood structure of SPN round function probably consists of a brick layer of local nonlinear permutations (usually S-boxes) followed by a multiplication with a diffusion matrix over a finite field (linear diffusion). Diffusion layers play a crucial role in providing resistance against two most powerful statistical attacks differential attack and linear attack .

AES, the most prominent example of SPNs, uses Maximum Distance Separable (MDS) matrix in the MixColumns operation. The MixColumns step employs the following circulant MDS matrix $$\small \left(\begin{array}{cccc} 2 & 3 & 1 & 1\\ 1 & 2 & 3 & 1\\ 1 & 1 & 2 & 3\\ 3 & 1 & 1 & 2 \end{array} \right)$$ where $1,2,3$ are element in the finite field GF($2^8$). It is shown that linear diffusion layers based on MDS matrices have optimal branch numbers and hence provide optimal diffusion properties for any AES-like ciphers in principle.


Lightweight Cryptography

The development of ubiquitous computing such as the Internet of Things (IoT) brings up the relevant security requirements. Specifically, the need for lightweight cryptography emerged from the lack of primitives that are suitable for constrained environments (e.g. RFID tags). Recent years have witnessed the development of various lightweight cipher designs. For more details on lightweight ciphers, the readers can refer to the blogpost Lightweight Cryptography by Ralph. In this post, we will focus on the design of the lightweight linear layer, i.e., lightweight diffusion matrices. We will show how the designers manage to achieve the tradeoff between security and performance.


Lightweight Recursive MDS Matrices

To reduce the hardware cost of MDS matrices, Guo et al. propose a novel design approach of recursive (or serial) MDS matrices. These matrices have been exploited in hardware-oriented lightweight block cipher LED and lightweight hash function PHOTON. The main idea of recursive MDS matrices is to represent an MDS matrix as a power of a very sparse matrix such as a companion matrix. In this way, the MDS matrix can be implemented by iterating the sparse matrix many times. A notable example is the PHOTON MDS matrix. $$\small M= \left(\begin{array}{cccc} 1 & 2 & 1 & 4\\ 4 & 9 & 6 & 17\\ 17 & 38 & 24 & 66\\ 66 & 149 & 100 & 11 \end{array} \right)= \left(\begin{array}{cccc} 0 & 1 & 0 & 0\\ 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 1\\ 1 & 2 & 1 & 4 \end{array} \right)^4$$ where the entries are elements in GF($2^4$). Compared to direct constructions, recursive MDS matrices have a substantially lower hardware area at the cost of additional clock cycles. Therefore, recursive MDS matrices are not suitable for low-latency applications.


Optimizing Entries in the Matrices

In the last five years, many constructions of lightweight MDS have been proposed. Most of them are based on matrices in some special classes, such as circulant, Hadamard, or Toeplitz matrices, with entries that can be efficiently computed. A novel idea by Li and Wang is to construct MDS matrix with elements in the form of binary matrices instead of elements in a finite field. Consider the following MDS matrix $$\small \left(\begin{array}{cccc} A & I & I & I\\ I & I & B & A\\ I & A & I & B \\ I & B & A & I \end{array} \right),\,\,{\rm where}\,\,\,\tiny A=\left(\begin{array}{cccccccc} 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1\\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 1 & 0 & 0 & 1 & 0\\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \end{array} \right)$$ and $B=A^{-2}$. This matrix can be implemented with only $106$ bitwise XOR, while a naive implementation of AES MDS matrix needs $152$ bitwise XOR.


Recent Works on Global Optimization

Most of the previous works consider the cost of an MDS matrix as the sum of costs of each coefficient and the cost of $nm(m-1)$ bit XOR operations, where the matrices are of order $n$ with entries from GF($2^m$). Actually this local optimization approach always overestimates the real cost. Indeed, many common intermediate values can be computed and then reused. In hardware implementations this can be done with merely wires. This implies that a globally optimized implementation can be significantly cheaper. Recently, Kranz et al. used Shorter Linear Straight-Line Programs to optimize hte previously proposed MDS matrices. They found a $4\times 4$ MDS matrix over GF($2^8$) (i.e., the same size of AES MDS matrix) that can be implemented with $72$ bitwise XORs. This number has been further reduced to $67$ in a recent work by Duval and Leurent following similar global optimization approach.


Near-MDS Matrices

It is well-known that any element of an MDS matrix over a finite field must be nonzero. Thus MDS matrices are very dense and hence costly in hardware implementation. MDS and recursive MDS matrices might not offer an optimal trade-off between security and efficiency. Near-MDS have sub-optimal branch numbers while they require less area than MDS matrices and they do not need additional clock cycles. Indeed, some diffusion layers constructed from near-MDS matrices outperform those based on MDS or recursive MDS matrices in terms of the FOAM framework proposed by Khoo et al.. Recently, near-MDS matrices have been adopted in some lightweight block ciphers, including PRINCE, PRIDE, and Midori. However, there is insufficient research on the construction and security properties of near-MDS matrices. These motivate us to present novel results on near-MDS matrices. Actually, We propose new designs of lightweight near-MDS matrices of order $5\sim9$. Further, it is shown that our linear layers with a well-chosen nonlinear layer can provide sufficient security against differential and linear cryptanalysis. The results have been published in ToSC 2017(1).


In sum, we already have pretty good solutions for various requirements in the design of lightweight diffusion matrices. Nevertheless, it remains open to achieve optimal implementation of linear layers in a specific cipher. It is also interesting to further investigate the impact of security on the adoption of non-MDS matrices.


Wednesday, February 21, 2018

Fast Fully Homomorphic Evaluation of Neural Networks in the Cloud


In order for Fully Homomorphic Encryption (FHE) to be deployed in real-world applications, still today --- even if a theoretical solution has been around for almost 10 years --- it is required to increase the efficiency of used algorithms. As the interactions of parameters and components of nowadays lattice-based realizations of FHE are non-trivial, schemes once set up to meet a multitude of design constraints, often end up having high requirements. Too high for some "killer"-application as run-times may pose a prohibitive hurdle.

In this blog-post, I'd like to present a use-case where an Fully Homomorphic Encryption (FHE) scheme achieves unprecedentedly fast classification of encrypted data, and makes scale-invariant homomorphic evaluation of neural networks (NN) possible.

Are privacy-preserving services in the Cloud relevant?

At this point, I think, I can skip philosophizing about the ubiquitous utility of machine learning, as we can see its impact everyday all around us. On the other hand the quest for privacy-preserving application of machine learning algorithms to user data is becoming a central topic of discussions recently. It is expected that the General Data Protection Regulation (GDPR), a result of the call for European law (which could serve as paragon internationally) to protect its citizens, its economy, will push forward innovation in this direction too.

Simply put, users of Machine Learning as a Service (MLaaS) in the Cloud, want to only share & upload encrypted images as input to the companies' powerful, pre-trained cognitive models.

Clearly, encrypting content ensures data confidentiality, assuming the associated private key of the  public-key encryption scheme never leaves the user’s trusted device. (As a side note; recent news reports suggest that such an assumption for user controlled devices are not always guaranteed. The Cloud operating on FHE encrypted data on the other hand is not possibly vulnerable to leak private user data through the whole class of cache attacks, i.e. Meltdown and Spectre.)

Let's briefly look at the problem setting.

To overcome conflicting interests of confidentiality and utility of data in the Cloud-based scenario, Fully Homomorphic Encryption can help the user to receive a useful answer to their encrypted question in a privacy-preserving way. Hence, the cloud needs to support homomorphic computations on the FHE encrypted inputs and send back the still encrypted result of this delegated operation in a reasonable time. In principal, only the legitimate user can decrypt the output using their secret key. The cloud service cannot deduce information from the random looking inputs, intermediate or final results, but can still charge the user for providing the service, e.g. classifying an image in this example.


Let's briefly look at the task.

First step when approaching a solution of how to use FHE for NN is defining minimal requirements of the concrete task and knowing what can be considered practical FHE.
We want to showcase fast homomorphic evaluation of a pre-trained NN to classify a depicted shape without leaking privacy of the input data at an 80 [bit] security level, e.g. images of handwritten digits from the MNIST dataset.
The output, given in less than two seconds, shall be encrypted scores assigned to each possible output and the highest score, decrypted by the user, is the most probable label of the input image.
A depiction of how input is propagated in order to evaluate a discretized deep neural networks with an arbitrary depth $d$ of hidden layers to arrive at a classification. Each neuron performs operations $f_i$; a function linearly depending on values of the incoming wires and weights followed by a non-linear operations. The latter is typically referred to as ``activation''.

As deep neural networks with $d$ hidden layers give good results in practice, we target this type with a scale-invariant FHE scheme.


Let's look at the problem solution.

In an attempt to bringing forth FHE in practice, our C++ code builds on top of an existing Fast Fully Homomorphic Encryption Library over the Torus (TFHE) and introduces a new framework for homomorphic evaluation.

To increase the efficiency, which is an important step paving the way to practicality, the underlying FHE scheme needs to be parametrized once for a given network.
Secondly, a security analysis is another crucial step in vetting the algorithms, ensuring their use maturely resists state-of-the-art cryptanalysis and fulfills the targeted security level.

The main capability of our scheme is that when evaluating a single neuron, the output value can readily be used for the next operation as it is bootstrapped to ensure low error propagation.
Close-up on a single neuron.
We apply the activation function directly to the weighted sum of inputs according to the network's wires, i.e. computing $y = f(x) = sign( \langle x, w\rangle )$, with fixed weights for the neuron and sign as activation.

Scale-invariance means that privacy-preserving evaluation of deep neural networks do not longer pose a hurdle, as computations carried out by every neuron in the network is independent of the total number of neurons and layers and hence scales linearly.

With this approach, we can report the performance result of an experiment to classify 10000 encrypted images from the MNIST dataset with more than 96% accuracy on average taking less than 1.7 seconds, using the TFHE library as a starting point.
Running an experiment on a trained neural net with 784:100:10--topology deployed in the Cloud.
An uploaded encrypted test image is input to the homomorphic evaluation of our scheme that classifies a depicted shape (without leaking privacy of the input data). The evaluation of the neural network outputs the encrypted scores $S_i$ assigned to each digit $i$. The highest score, decrypted by the user, is the most probable label of their image.

I'd like to stress that the good performance of this scale-able approach is not limited to homomorphic evaluation of neural networks with one hidden layer, as depicted above, but can be applied to deep neural networks, that in practice could be composed of possibly a hundred hidden layers or an even broader class of cognitive models.

For a detailed, formal description I refer to the full version of the paper or you may try out the proof-of-concept implementation code, available online that shows how to obtain these research results, applying our generic framework to a trained NN and MNIST dataset inputs as a demonstration.

Let's look at directions for future work and open-questions.

Finally, mentioning limits on the functionality of our FHE scheme, and pointing out the applicability to other well-specified domains rounds off this treatment here.

To comfort a potential concern for the service providers that their users might be sending malicious requests, to evaluate private networks with our framework is not dealt with at the moment, although it is in principal possible.
They could either try to learn the company's intellectual property (the weights and the topology of the neural network itself), or try to derive sensitive information encoded therein (which could be a breach into the privacy of the training dataset).
In this latter case a statistical databases studied in the differential privacy literature can be used in the training phase.

An open question is how further performance gains can be achieved by refining the algorithms. Also listing all general cognitive models that are possible is interesting.

So long, stay tuned for faster solutions and more general demonstrations!

Thursday, November 2, 2017

Compressing ledgers of financial transactions

The history of modern banking begins almost 550 years ago, with the establishment of Bank Monte dei Paschi di Siena, in nowadays Italy. However, it wasn't until 1980s (early introduction of home banking), when pioneering financial institutions started to make use of computing machines to automatically process part of their financial transactions, and thus replace the manual, more error-prone process. As the availability of the Internet increased, starting with early 2000s, many major banks began to offer Internet banking to their customers. This means that one can access his/her account's balance or history through a web-browser or smartphone and initiate or receive transactions.


Nowadays, a number of transactions in the order of millions are processed on a daily basis by a large bank. As the time passes and more and more transactions are processed, the size of this set only increases. Thus, one should think at potential solutions for removing the old and useless ones, such that they can be safely archived. The story repeats for the case of crypto-currencies (with the corresponding modifications - multiple senders or receivers allowed, accounts replaced with addresses). If one considers for instance the size of transactions processed by a well-known crypto-currency, in the first six years of its lifespan, the size of the ledger reached aound 95 GB, while only in the first 9 months of 2017, more than 40 GB were processed. Such a growth implies increased computational costs while verifying the actual balance of a specific address, as a traversal through the set of addresses needs to be employed.


A first easy step one can do is to try to switch from the representation of transactions as lists, to a more visual one, which represents accounts as nodes in a graph and transactions as edges. Since multiple edges are allowed, the graph becomes in fact a multigraph. In their work, recently presented at SecureComm17, Rémi Geraud, David Naccache and Răzvan Roșie put forward the problem of finding "nilcatenations" in sets of transactions. Loosely speaking, a nilcatenation is a subgraph in a given multigraph with a special property: the balances of the nodes are zero, for each existing account. Stated differently, every single user part of a nilcatenation receives the same amount of money that it gets. Since the balances in such components do not affect the global balance of the original multigraph, nilcatenations can be decoupled and archived.


Some interesting observations can be made about nilcatenations. Any occurrence of such a component can be only as a part of a strongly connected components (SCC): a maximal subgraph of a directed graph with paths between any two nodes. Another observation can be made regarding simple, obstructing nodes: if we identify nodes with the in and out degrees set to 1, but with different weights for incoming and outgoing edges, then such nodes cannot be included in a nilcatenation (we dub them "first order obstructions"). After clearly formalizing it, the problem turns out to be be NP-Complete, via a reduction to the 0-target SSP problem (which is equivalent with SSP).


After pruning the original graph into smaller components (by employing SCC-split and first-order obstruction removal steps until convergence), one can benefit from known techniques in attacking the (multi-dimensional) subset-sum problem for each component, independently. Particularly, we can see the problem of finding nilcatenations as a multi-dimensional version of the SSP, and tackle each component independently. The known techniques employing the usage of an SVP-oracle the density work on low density instances. An overview of our heuristic is given in the second picture. More details can be found in the original work.

Sunday, September 3, 2017

Simulating a quantum computer in classical computer

Quantum Simulator? Quantum Computing?


We heard a lot about quantum computers and what happens when one is build. We have several algorithms that can run on this computer and it will "shift" with cryptography. However, can we simulate a quantum computer? Can I run quantum algorithms already? 
The simple answer is: yes, for both questions.

Yes, the simulation is possible. However, that does not mean that we can break the crypto. The simulators that are available to use have several restrictions. Such as:
  • Number of operations;
  • Size of circuit;
  • Number of qubits that is possible to use;

Simulators, Special languages and etc...

IBM Q Experience

IBM developed "IBM Q experience" where you can design your circuit and run. The circuit will run on their computers and when it is ready, i.e., when your computation finish you will receive an email. In addition of the simulator, IBM added a nice introduction about quantum algorithms.

Limitations

The problem with the IBM Q  is that it just allows you to use 16 qubits as a member of the community or 17 qubits for commercial use. Also, if you use the community edition sometimes you could be "stuck" in the queue. The reason is that with the community edition you share the "quantum simulator" with the other users and it cannot run more than a certain number of circuits. 


Microsoft Liquid

Microsoft is in the running of "quantum simulators", in fact, they developed their own language for quantum simulation, they created Liquid.  It is "cross-platform", i.e., you can run in Windows, Mac OS or Linux. However, if you decide to use in your Linux distro, you will need to download Mono and then run an executable compiled for Windows. 

The good side of "The Language Integrated Quantum Operations Simulator" is that it is based in "F#" and it came with a lot of examples such as "Quantum Teleportation", "Shor's Algorithm" and others algorithms. 

In fact, the paper Quantum Resource Estimates for Computing Elliptic Curve Discrete Logarithms uses liquid to implement an attack on ECC even with the following limitations. 

Limitations


Liquid allow you to use just 23 qubits. However, the paper before used more than this amount, the reason (when I asked on github) was that the authors are from Microsoft and they can use as much as they need. 


LibQuantum

The libquantum is a library written in C that you can extend and use to create your circuit. It is considerably fast and easy to use. The good side of this library is that you can use as many as qubits you want, the only problem is that if you use a big amount of qubits your "program" will be very slow and it will use a big quantity of memory. 


When I am running tests, I am using this library. I am going to create a small tutorial to use this library. I am assuming that the reader/user is on linux distribution and has a little bit of knowledge of C. First step is to download the library: www.libquantum.de

After, downloading, compile the library and install the library in your system so you will be able to use it. 

First, let's create a register and do some operations:


In the code, we first create a quatum register with initial value in 0 and with 2 qubits. After, we print the content of the register, in this example it will be "0" and finally we do some operations with the register. We perform a simple "CNOT" gate but because the value of the register is "|00>" nothing will change. If we change the value in the 4th line from 0 to 1, in the end we will have a register with "|11>".

The library provides other operations such as: Toffoli, Sigma X, Sigma Y, Sigma Z, phase scale, phase kick and hadamard. 

In the code below we are going to have a more "complex" circuit:


Now, we initialize the quantum register with the value "|10>" and perform first a Hadamard transform to put the qubit 0 in superposition and after we aply a cnot gate with the value from qubit at position 1 to the target qubit at position 0. In the end, we are going to have a composed state as:
0.707107 +0.000000i|3> (5.000000e-01) (|11>)

 0.707107 +0.000000i|0> (5.000000e-01) (|00>)

The library came with more examples that you can compile and run by yourself. I hope that you like this quick tutorial how to simulate quantum computers in classical computers. 

Wednesday, July 12, 2017

Looking for fast OpenSource algorithms on lattices? Try fpLLL!

Dear ECRYPT-NET fellows and readers,
I have some news from CWI @ Science Park in Amsterdam where fplll-days-3, organized by Leo Ducas and Marc Stevens, are currently taking place!

Previously held at ENS Lyon, this is the third time already for such a combined effort to enhance the fplll OpenSource project. fplll has become a lively project with many suggestions that help to debug and feature requests for continuously improving the code-base in various directions.

As a brief history of fplll it can be noted that the first code was written by Damien Stehlé. It is now written by many active contributors (according to GitHub, the most active developers are: Martin Albrecht, Shi Bai, Damien Stehlé, Guillaume Bonnoron, Marc Stevens and Koen de Boer) and maintained by Martin Albrecht and Shi Bai.

What does fplll do?

What fplll does, depending on additionally specified parameters, is performing its implementation of the LLL algorithm using fast floating-point arithmetic under the hood. Other available lattice reduction algorithms are HKZ, BKZ reduction and variants. These algorithms can be applied on an input lattice represented by a matrix, given i.e. in a file as a set of row vectors, to obtain a reduced representation --- a versatile starting point of numerous applications. Furthermore, fplll allows to solve (arbitrarily adjustable approximate-) SVP and CVP instances, when used to find a shortest lattice vector relative to a user-chosen center.

To get started, one can not only use and compile the fplll C++ sources to run experiments, but the often dubbed 'user-friendlier variant' fpylll which provides Python access to the underlying, fast C++ functions. Finally, every mathematician's dear, Sage, (at least for anyone who isn't fully satisfied by pure Python) benefits from an improved fpylll as well, because importing the fpylll module seamlessly allows direct usage within Sage. Soon a new Sage version, SageMath 8.0, will be released, which ships the current fpylll module that accesses said, fast C++ routines.

The significance of lattice-based cryptography is repeatedly mentioned in paper's abstracts and has been explored in past ECRYPT-NET blog posts i.e. on 'What are lattices?' and 'Learning problems and cryptography: LPN, LWE and LWR'.

Significance for cryptanalysis

From a cryptanalysts point of view, the significance lies in the fact that most security models of lattice-based cryptography typically assume lattice reduction to be the most promising attack-vector against the underlying lattice-based primitive. Some security models are able to immediately (and provably) rule out certain classes of attacks and, for instance, a few others can be argued to be less promising than known formulations as lattice problems. Such arguing hence leads to fplll basically representing the SVP/CVP-oracle and it's performance is deemed as a lower bound for the practical performance of an attack. Typically attacks require many calls to such an oracle function, thus such an approach of taking the time of a single run as a lower bound is used to set parameters in experimental cryptosystems, when commonly a more conservative
lower bound including a security margin is chosen. Specifically, many proposed lattice-based crypto-schemes have been parametrized such that these best known-attacks are taken into account.

I suppose, I do not need to point out the numerous advantages of OpenSource software (over closed-source projects) but and its value to the research community the significance of having freely-available, fast lattice reduction routines is manifold.

To begin with, there is a discrepancy between what theory predicts and algorithmic performance in practice. Techniques described in the literature, summarized as BKZ2.0, leave a broad range of implementation choices. Different groups using different software and metrics where their approach is supreme, naturally lead to results that are hard to compare. If there was software that comes with meaningful defaults for many standard lattice tasks, is customizable, and extensible to individual lattice solutions, then there is hope that the community can agree on problem instances. Ideally, problem instances should cover deployed or model experimental cryptosystems such that they embody a meaningful benchmark for new designs.

Originally, fplll was trying to provide such algorithms with reasonable speed. Recently, developers broadened theirs goals and try to fill gaps of cryptanalytic research. Concretely, now fplll strives for speed from low level optimizations, and by implementing diverse techniques from the literature hence catching up with the state of the art. Additionally, it can be easily tweaked on a high algorithmic level with the Python layer fpylll, yet easily exploiting all the available optimized routines boosting the performance. One can argue that together with diverse lattice challanges this project helps to benchmark and compare various efforts to cryptanalyze cryptographic primitives used in cryptosystem's constructions.

A couple of Lattice Challenges have been proposed (SVP-, Ideal-, LWE- and Ring-Challenges) and it seems that researchers also test their code on these instances, which aids a comparison of approaches.

Having them conveniently accessible and high-level, fast lattice operations allows to quickly try out a new idea, or slightly different approach which saves time and hopefully makes researchers willing to share their tweaks and algorithmic tricks more often in the future.

The workshop

To come back to the start, the fplll-days are meant to be a hands-on, work-oriented workshop that enables direct discussions with core developers and with the goal to improve existing functions and the many algorithms involved. The general idea behind this meeting is to optimize often used routines, make it user-friendlier and accessible to cryptanalysts, for example.

By using code profiling tools, performance and memory usage bottlenecks can be spotted in a first overview, which allows to direct efforts where they might lead to significant speed-ups. After discussing know issues and useful features, this workshop tries to provide an implementation of numerically stable algorithmic variants to push the dimension LLL can handle (like Givens rotations while resorting only to machine floating point type), sophisticated pruning strategies to speed-up enumeration, and implementing sieving algorithms --- all as a promising new direction in finding short vectors faster.

It is exciting to join and shape such a project so let's hope for many interesting projects that got started and delegated here to be completed during this week and further interested researchers turning into active users joining the party and coming up with meaningful, reproducible research results. Remember that Newton "has seen further, by standing on the shoulders of giants" thus achieving progress, and so you too are encouraged to become active, using an already established framework!

Thursday, July 6, 2017

A Brief Survey of Physical Attacks on Cryptographic Hardware

Previously, the topic of side-channel attacks (SCA) was covered on this blog. These attacks are very popular, for they can be mounted using very cheap equipment and do not necessarily require high level of expertise. Hence, SCA are widely accessible and present a common danger. As a result, they are well researched, and various countermeasures have been developed. Still, they are just a small part of the stack of physical attacks. Figure 1. crudely depicts the this colorful “stack”. The one thing all physical attacks have in common is that it is assumed that the attacker must gain physical access to the target device, and attain it for a certain amount of time. In the remainder of this post, a brief survey of these attacks will be given. More detailed descriptions will be provided in a series of posts that will follow. 



Figure 1: Stack of Physical Attacks

Invasiveness
The first segregation is based on the “invasiveness”. Invasive attacks entail breach of target’s packaging, or its surrounding enclosure. This is often a very delicate process which often requires expensive equipment and a high level of expertise. Since the breach is destructive by nature, it can be easily detected by subsequent users — if the chip itself was not destroyed in the process that is. The goal of this breach is to gain access to internal state of a chip. Commonly attackers target on-chip busses or storage elements, which may contain sensitive intermediaries of cryptographic computations or keys themselves. Aforementioned enclosures are a privilege of expensive devices, often called Hardware Security Modules (HSMs). HSMs may cost tens of thousands of Euros, and are envisioned to provide secure computational environments at high speeds. Apart from restricting access to the chip using sturdy build and “tamper-proof” latches and locks, enclosures are frequently equipped with seals and coatings that are supposed to witness any foul play that may have taken place. Additionally, tamper detection measures may be built in, envisioned to void all sensitive information at the first glimpse of attacker’s activities. Hence, invading these juggernauts is commonly more expensive and time consuming. Unfortunately, market-share of HSMs compared to bare smart-cards and RFIDs is neighboring negligible, especially with the rise of the IoT.

On the contrary, non-invasive adversaries do not cause any structural damage to packaging nor enclosures. They interact with the target device using its existing interfaces, and mediums that require no mechanical interaction with the device. They are virtually free, but may require significant expertise of attackers.

Activeness
The second segregation is based on the “activeness” of the attacker. Active attacks entail induction of computational (logical) or structural changes in the target chip. When we talk about computational changes, a very common example are Fault Injection (FI) attacks. There are two phases to FI attacks: fault injection during the execution of the targeted algorithm, and the analysis based on the observations of faulty outputs. A common method for altering device’s execution is called clock glitching. Namely, by introducing a premature edge on the clock signal, attacker violates devices’s critical path. As a result, incorrect values are captured in device’s registers. Alternatively, faults can be induced by shooting a laser beam with enough power to change the state of the device, while allowing it to remain operational. Here, any data or control register fall under “state of the device”. For example, round counter, commonly used in implementations of block ciphers, is a very favored target for such faults. Active attacks may require higher level of technical skill, and a more sophisticated setup.

On the contrary, passive adversaries may only observe device’s execution, while interacting through its predefined interfaces. Well-known SCA fall under this category. These attacks are well researched, and can be mounted using very cheap equipment. Developed techniques (e.g., Mutual Information Analysis) are extremely powerful, and once incorporated in the attackers setup can be reproduced quite trivially. Consequently, although they entail only limited exposure of the device, they pose a serious threat for they are very accessible even to attackers with modest capabilities.

The Reality
Activeness and invasiveness are two orthogonal properties, resulting in a total of four possibilities (although I find that the existence of “invasive and passive attacks” calls for a philosophical debate). Unfortunately, situation is much more complex than that in practice. Firstly, attackers are likely to use combined attacks. For example, FI + SCA may be a very powerful combination. Additionally, the distinction mentioned above is not as binary. Rather, along each of the two orthogonal axes there are many shades. For example, faults can be injected in some chips by applying laser beams to their packaging (non-invasive), while others may be shielded from such beams (hence they have to be attacked invasively).

Consequently, there exists a myriad of possible attack variations. Moreover, even if we lock on a certain extreme — let us say passive, non-invasive, CPA — quality of the measurement setup plays a very significant role. A 500 Euro oscilloscope can hardly match its 30000 Euro counterpart. In hindsight, there are no upper bounds to the power of a skilled invasive attacker performing a battery of active and passive attacks, apart from the temporal and financial constraints.


Taking all above into account, choosing a set of countermeasures is a difficult task (let alone implementing them properly). Bare in mind that these countermeasures are not for free. They may significantly increase the price of devices, reducing the profit margins severely. Therefore, there are no silver bullets in protection against physical attacks. In other words, in practice security engineers work to demotivate attackers with high probability. They try to stop “the attacker of interest”, rather then stopping all attacks. To achieve this, first step is identifying potential attackers. This process is often called profiling, and in a nutshell I would describe it as follows. Please note that this is a gross simplification of the problem, meant to depict the general idea. No distinction is made between fixed (price of the setup) and recurring (every time the attack is mounted) costs, nor between temporal and financial costs. Lastly, please note that the value of assets is heavily simplified as well, for the sake of avoiding a philosophical discussion yet again.


Manufacturer’s Dilemma
Assume that a device D, which costs d to manufacture, protects assets worth x , and features a
countermeasure C that costs c to deploy. We may consider D to be secure against an attacker A, who can mount a successful attack at a cost a (which includes A’s investment in
development of expertise), as long as
x a+μA,

μA being the attackers profit margin. In other words, if the cost a is high enough attacker can not obtain desired amount of profit for given assets. On the other hand a manufacturer M that produces D wants to sell for a price m such that
m d + c + μM,
μ M being M’s profit margin. In other words, price of deploying countermeasures c directly cuts into manufacturer’s profits. Looking at these inequalities, it seems that there is no dilemma at all. Nevertheless, cost of attack depends on the selection of a countermeasure, i.e.,
a =f(c).
Assuming that an increase in c leads to the increase in a , by applying some high school math (readers are welcome to play with it), we see that the selection of C must be performed based on the value of assets it protects. A more detailed discussion on this topic will be given in one of the following posts.
In conclusion, physical attacks are a great threat. As IoT progresses, and the amount of ubiquitous devices increases their potential impact may only grow. Deploying devices that protect assets against physical attacks is a complex problem, which demands bespoke solutions, tailored to individual use cases.