Sunday, February 19, 2017

Differential privacy


These days one reads a lot about the right to privacy. But what is it and how does it differ between the real and the digital description of the world? Briefly, it is a person's right to have and more importantly maintain control over information about oneself, which is fundamental to human's freedom of self-determination. In the digital context one seemingly lost this right in favor of using free services, handy tools that satisfy people's urge to communicate with each other and stay in touch and up-to-date at all times. Since more and more people realize that behind such apps and web pages naturally there are business models, society increasingly demands to reclaim control over collected personal data, which has been provided, unsolicited or involuntarily, to online merchants or telephony service providers at the time of use, for example. On the other hand collecting user information in databases is crucial as corporations offering named services see it. In order to make good products, tailored recommendations, and especially in the age of big data and machine learning, precise predictions by evaluating various functions on the data.

Statistical database queries have been studied quite some time now, and in fact it turns out that often it is sufficient to allow query access only to a population's aggregate data, not individual records, to derive useful statistics and achieve desired functionality! A common approach is to merely allow aggregate queries (i.e. range, histogram, average, standard deviation, ...) and rather than returning exact answers about sensitive data fields to specify intervals or give imprecise, statistically noisy counts.

You might have asked yourself, don't cryptographic techniques that have been omnipresent in this blog such as FHE (Fully Homomorphic Encryption) or secure MPC (Multi-Party Computation) solve this problem? Wouldn't it be possible to encrypt the user data yet for the service provider to compute useful statistics on it?
Indeed, it could be realized with the general FHE, MPC toolkit, but currently it is inefficient to operate them at that scale in practice, such that statistics over huge quantities of data are useful to infer statements about a given database. Hence specific, more slender tools have been designed to overcome this gap. Whereas FHE avoids a trusted 3rd party to compute (i.e. arbitrary functions or sophisticated statistics) on users sensitive data, here typically one explicitly allows a trusted 3rd party to collect and aggregate data in a privacy-preserving fashion. Users might do so i.e. when installing an app and argue to have an advantage for themselves like good default settings, an overall performance gain; or it might be a requirement to share information in order to use the service for free in the first place.

Differential privacy (often abbreviated DP) is a framework for formalizing privacy in statistical databases. It can protect against so called de-anonymization techniques that try identifying an individual record by linking two separately released databases that have been stripped off (quasi-)identifiers and look innocuous. Especially apriori knowledge or known partial history can be leveraged to derive more information from a released "anonymized dataset" other than the purpose it was originally intended to serve.

Let's look at a mathematical definition that captures and formalizes the notion of privacy and which has been studied in cryptography in the past 10 years. Let $d,n$ be a positive integers and $f: X^n \rightarrow \mathbb {R} ^{d}$ some statistics on a database comprised of $n$ records.

An algorithm $\mathcal {A}: X^n \rightarrow \mathbb {R} ^{d}$ that computes $f$ is said to have the $(\epsilon, 0)$-differential private attribute or ($\mathcal {A}$ is $\epsilon$-DP, for short) if for all neighboring subsets of a given database $x_{1} \neq x_{2}$ and $x_{1} \sim x_{2} := x_{1} \sim_1 x_{2}$ (they differ in just 1 element), and all subsets $S \subseteq \mathbb {R} ^{d}:$
\begin{align}
\mathbb{P}[\mathcal{A}(x_{1})\in S]\leq e^{\epsilon } \cdot \mathbb{P}[\mathcal{A}(x_{2})\in S]
\end{align}
holds. Looking more closely at this definition $\forall x_{1}, x_{2} \in X^n, x_{1} \sim x_{2}
$$\mathbb{P}[{\mathcal{A}}(x_{1})\in S]
\leq e^{\epsilon } \mathbb{P}[{\mathcal{A}}(x_{2})\in S] \Leftrightarrow \frac{\mathbb{P}[{\mathcal{A}}(x_{1})\in S]}{\mathbb{P}[\mathcal{A}(x_{2})\in S]} \leq e^{\epsilon }\\ \Leftrightarrow \log \left(\frac{\mathbb P[\mathcal{A}(x_{1})\in S]}{\mathbb{P}[{\mathcal{A}}(x_{2})\in S]}\right) \leq {\epsilon}$$
we can identify the so called "privay loss" of an algorithm (or in this context often called mechanism) $\mathcal {A}$. In this setting $\epsilon$ can be called the privacy budget. In less exact terms it captures the following: By specifying the privacy budget, it is possible to control the level of privacy and make an algorithm respect this additional constraint by techniques introduced below.
For those familiar with the concept of max-divergence, the definition of privacy loss is in fact the definition of $$D_\infty( A(x_1) || A(x_2)) := \max_{S\subseteq {\textbf supp}(A(x_1))} \log \left(\frac{\mathbb P[\mathcal{A}(x_{1})\in S]}{\mathbb{P}[{\mathcal{A}}(x_{2})\in S]} \right).$$
Furthermore, the multiplicative factor $e^\epsilon$ can be -- using a common approximation for small $\epsilon<1$ -- viewed as $1+\epsilon$:
$$e^\epsilon = exp(\epsilon) = 1 + \epsilon + \epsilon^2 + \dots \approx 1 + \epsilon.$$
In less formal terms this definition says that a given result is approximately the same whether it is computed using the first or the second, neighboring database.
A more general definition, that adds flexibility -- but also makes the proofs less elegant and more technical -- is $(\epsilon, \delta)$ -differentially privacy, when
$$\mathbb P[{\mathcal {A}}(x_{1})\in S]\leq e^{\epsilon } \cdot \mathbb P[{\mathcal {A}}(x_{2})\in S] + \delta.$$
Interpreting the definition, the goal of DP is that the risk of violating one's privacy should not substantially increase as a result of either appearing in a statistical database or not. Thus an analyst should not be able to learn any information about a record (i.e. participating an online questionnaire) that couldn't have been learned if one had opted not to participate or answered the questions randomly by rolling a die or flipping a coin rather than answering truthfully.

To overcome the fundamental challenge -- the trade-off between utility of data or accuracy of returned answers and privacy of records -- the set goal is to learn as much as possible about a group's data while revealing as little as possible about any individual within the group. Transforming an algorithm into a DP-algorithm requires probabilistic tools. The sensitivity of a function of $f$ is a good measure of how much statistical noise is needed to mask an answer:$$\Delta f=\max_{x_1 \sim x_2} ||f(x_{1})-f(x_{2})||_{1} = \max_{x_1 \sim x_2} \sum_{i=1}^d |f(x_{1})_i-f(x_{2})_i|.$$Low sensitivity of a function, i.e. small change of output given two neighboring inputs, allows to add statistical noise to achieve privacy yet don't use utility. The Laplace mechanism, adds noise from the Laplace distribution $\mathcal L(\lambda)$, i.e. noise $\eta(x)\propto \exp(-|x|/\lambda)$ which has 0 mean and $\lambda$ standard deviation. Substituting Laplace noise with other probability distributions, such with a 0 mean Gaussian and $\lambda$ standard deviation would be possible, but influences proof details.
A typical construction now is, instead of computing $f(x)$ directly, to compute $\mathcal A(x) = f(x) + \eta(x)$ and obtain a $\epsilon = \frac{\Delta f}{\lambda} $-DP algorithm, since the noise of two neighboring databases doesn't exceed this $\epsilon$ even in the worst-case. In terms of composability, sequential respectively parallel composition leads to a sum of all occurring $\epsilon_i$ resp. maximum of all occurring $\epsilon_i$ differentially private steps within the composed mechanism. This allows to efficiently turn algorithms into DP-algorithm.

These basic construction, including detailed proofs, and much more were covered during the 7th BIU Winter School on Cryptography named "Differential Privacy: From Theory to Practice" featuring speakers, who defined and contributed already roughly 10 years ago to the field. Slides are already online and video recordings about to appear on the webpage.
Furthermore, relationships of DP to various, related scientific fields ranging from statistics to machine learning and finally game theory were explored.

Concluding, in wake of the growing awareness of privacy issues in the digital domain, together with stricter interpretation of legislation and finally the possibility to satisfy most interests by anonymized data anyways; several big players strive to provide differentially private collection of data. Some companies market themselves as quasi-pioneers in privacy topics for some reasons: It pays off to be perceived as the first one; they would be facing various problems in the near future anyways, if they don't respect these issues; and most importantly, they can continue their business model: creating value of user's data. The more information is queried from the database, the more statistical noise has to mask the correct answer in order to meet a predefined privacy budget bound. This total allowed, justifiable privacy leakage can be specified in the number of admissible queries or the answer accuracy.

Provable cryptography avoids the situation of mere obfuscation that can be undone by a clever enough attacker / strategy -- given the security assumption holds -- and provides bounds and thus a guideline on how to choose parameters to guarantee a desired level of privacy. Algorithms invest a given privacy budget at privacy-critical steps. With this in mind, differential privacy is an additional design paradigm for cryptographic algorithms and protocols to keep in mind.

I'd like to end this cryptologic point of view on achieving privacy goals on the internet, as I started; with a fundamental sociological question. One thought that remains standing out is: Shall we collect as much data in the first place? Is it really necessary to predict individuals as online merchants? Do we want this ubiquitous tracking? As with advanced technologies, who's long-term effects cannot be predicted, maybe also in aggregating big data and tracking the only winning move seems to be not to collect data in the first place.

Friday, January 27, 2017

Symmetric Key Ciphers for FHE

In recent years, small devices with limited storage and computing power have acquired more and more popularity. This phenomenon, together with the appearance of cloud services that offer extensive storage and high computing power, has made computation outsourcing a key application to study. This setting, where a user with limited resources stores data on a remote server and also asks this server to make some computation on his behalf, comes together with new challenges and concerns related to privacy and security.


A simple scenario

The user Alice encrypts her data under a homomorphic encryption scheme and sends the ciphertexts to the server (eg. cloud). Now the server computes some function (eg. mean value) on Alice's encrypted data and sends back an encryption of the result, which can be decrypted by Alice with the appropriate key. Despite its simplicity, this model still requires a lot of effort from the user. In fact, all the FHE schemes that we know of have very large keys and heavy encryption overhead. The most natural approach is then trying to free the user from the burden of complex homomorphic computations, by asking her to do only the key generation part, the symmetric encryption and the final decryption, while leaving all the rest to the powerful server.


A hybrid framework

The user Alice generates a secret key $sk^S$ for a symmetric encryption scheme, a pair $(pk^H, sk^H)$ for a homomorphic public key encryption scheme, and sends $pk^H$ to the server together with an encryption of $sk^S$ under $pk^H$. Then, Alice encrypts her data under the symmetric encryption scheme and sends the resulting ciphertexts to the server, which homomorphically evaluates the decryption circuit of the symmetric encryption scheme, thus obtaining ciphertexts under the homomorphic scheme. This last operation can be seen as a sort of bootstrapping (from one scheme to another one), but does not involve any circular security assumption. At this point, the server can homomorphically compute a certain function $f$ on the resulting ciphertexts and send back the result, encrypted under the homomorphic encryption scheme. Alice can finally recover the result thanks to her $sk^H$.

Despite numerous improvements in recent years FHE remains highly unpractical, and one of the main problems is limited homomorphic capacity. In fact, all the FHE schemes we know of contain a noise term that grows with homomorphic operations. The homomorphic capacity corresponds to the amount of computation that can be performed before having to ``refresh'' the ciphertext via the expensive operation of bootstrapping. It is thus of vital importance to manage the error growth during homomorphic operations.


State-of-the-art ciphers

AES is a natural benchmark for FHE implementations. The homomorphic evaluations of some lightweight block ciphers such as SIMON have been investigated by Lepoint and Naehrig as well. It turns out that there is much room for improvement in optimizing the multiplicative depth. In this direction, many dedicated symmetric key ciphers with low-noise ciphertexts have been proposed.

  • In 2015, Albrecht et al presented a flexible block cipher LowMC, which is based on an SPN structure with $80,128$ and $256$-bit key variants. The design philosophy is to minimize the multiplicative complexity, which is more relevant in the multi-party computation context. It also has very low multiplicative depth compared with existing block ciphers.

    Later, Dinur et al mounted optimized interpolation attacks against some instances of LowMC. More precisely, some $80$-bit key instances could be broken $2^{23}$ times faster than exhaustive search and all instances that are claimed to provide $128$-bit security could be broken about $1000$ times faster than exhaustive search. These attacks show that the ciphers do not provide the expected security level. The cryptanalysis also led to a revised version of LowMC.
  • Keyvrium was designed by Canteaut et al and is a stream cipher aiming at $128$-bit security. It is based on the lightweight stream cipher Trivium, which has $80$-bit security. Trivium and Keyvrium both achieve similar multiplicative depth compared to instances of LowMC. They have a smaller latency than LowMC, but have a slightly smaller throughput.
  • The FLIP family stream ciphers proposed by Méaux et al have the lowest depth compared with previous ciphers. Several versions are provided including $80$-bit and $128$-bit security instantiations. The main design principal is to filter a constant key register with a time-varying public bit permutation, which allows for small and constant noise growth. However, FLIP has a much higher number of ANDs per encrypted bit than the above ciphers.


To sum up, blending symmetric ciphers and FHE is definitely an interesting idea that could prove fundamental for bringing homomorphic encryption in the real world.

The other fellows also made contributions to this blog post.

Monday, January 16, 2017

WhatsApp backdoor? Should I be worried?

WhatsApp Backdoor!!! 

Let's start this blog entry with that title and it is very easy to get reader's attention (Especially in the security community) with big bold letters and saying that there is a backdoor. I say this because it was exactly what happened on January 13th (Friday the 13th, maybe it is correlated) when The Guardian released the news. After that, the security community/websites and others started a big debate about it. In fact, some hours after the release from the Guardian others websites started saying that it isn't a backdoor such as: Gizmodo, Open Whisper System blog and others' websites.

Well, we are going to give some "conclusions about the subject" later. First, let's explore more about it.

What is a backdoor?

If we talk about computer systems or cryptography, we are going to have something like this definition: "A backdoor is a method, often secret, of bypassing normal authentication in a product, computer system, cryptosystem or algorithm etc. Backdoors are often used for securing unauthorized remote access to a computer, or obtaining access to plaintext in cryptographic systems." In other words, it is an easy way to obtain access.

Are backdoors common?

Unfortunately, yes it appears in the news and computer systems more than we want. Also, after the Snowden revelations the term backdoor is very common in the non-academic area. However, it does not mean that people become more worried.

What is the real problem?

Perfect, we learned about backdoors. Now, let's explain the problem with WhatsApp. Let's see an hypothetical case, imagine that you have a smartphone and for some reason (fall in a lake, you have been abducted by E.T. and forget your phone in the spaceship) you don't have access to your smartphone anymore. Fortunately, you can recover your phone number, buy a new smartphone and reinstall Whatsapp. However, when you reinstall WhatsApp, it generates a new key. However, now we have a problem what the app should do with the messages that you received when you were offline. The app could just drop the messages and never deliver or it could ask for your contacts to encrypt with the new key. According to the Guardian this approach is the backdoor:
"WhatsApp’s implementation automatically resends an undelivered message with a new key without warning the user in advance or giving them the ability to prevent it."

I, personally, say that it is a mistake that WhatsApp just show a notification saying that the key change and automatically resends.

Is it exploitable?

I will say: Yes, it is.
Let's see how it is possible to exploit.

In fact, it consists in exploit another vulnerability, it is present in the SS7 protocol. The Signalling System No. 7 (SS7) is a set of telephony signaling protocols developed in 1975, which is used to set up and tear down most of the world's public switched telephone network (PSTN) telephone calls. It also performs number translation, local number portability, prepaid billing, Short Message Service (SMS), and other mass market services.

The SS7 vulnerabilities are known since 2008. However, it had the attention from media in 2014, Tobias Engel at CCC showed how it is possible to exploit SS7 vulnerabilities. You can check exactly how in this link. In the video, he shows that it is possible to change the phone number between cellphones. In the same congress and right after Tobias, Karsten Nohl showed how to read "others SMS", the talk can be followed in this link.

What is the SS7 vulnerability?

Let see very quickly the structure of a GSM network, the next figure shows it.


So, if you are inside of the SS7 network, you can just start listen the network and as we saw in the videos it is possible to change somethings such as your number.

Since it is possible to change your number and the verification from whatsApp can be done by call or a SMS verification. What happened if I have access to SS7 network and change my number to any other number and I reinstall WhatsApp and pass the verification?

According to the policy of WhatsApp, my smartphone is going to generate a new key and all the new messages are going to be encrypted with this new key. I didn't create a new attack here, it has been done before as this youtube video shows:


It is valid to say that this attack can be used in others messengers. However, others apps have another policy, e.g.  Signal drops the messages until you "recheck" the new key.

 If you are interested to see the how messengers handle this key problem, there is a nice blog entry at Medium.

Conclusions? 

The news from Guardian is not a real backdoor, it is a known problem about handling with keys and the news was more like a "click bait". However, the good thing about all of this history are the discussions in different communities and an alert to people take care about their privacy.

All this incident makes us to start doing some social and technical questions, such as:
Should we accept the news from the media as truth?
Why the hell we continue to use a system from 1975?
Can we create a better way to handle with keys?


 In addition, I should say that if you are looking for more privacy to your life. You should check Privacy Tools, in this case, the section about "Encrypted Instant Messenger".
 

Wednesday, January 11, 2017

RWC 2017 - Post-quantum cryptography in the real-world

A new year takes off and, along with it, thousands of resolutions are formulated. Although I am not the right person to talk about them (my diet will begin next Monday), I wish to discuss a resolution that the cryptographic community as a whole has set for itself in this 2017. Because that's what people do at Real World Crypto (RWC): they talk about new threads, topics could be worth exploring during the new year, directions for their researches and interests. This year, for the first time in RWC, post-quantum cryptography (PQC) was given an entire session, clear sign that time is changing and the moment has come to bring the discussion to the real world. The message is clear: even if quantum computers are not popping up in tomorrow's newspapers, we can't postpone any longer.

A very simple reason for this was given by Rene Peralta, of the NIST PQC team, during the overture of the session: standardisation takes time, up to seven years if we start right now, and full transition takes even longer. I found Rene's presentation to be neat and direct: our public-key cryptography fails against quantum computers and our symmetric one needs some (non-drastic) modifications. The resolution is to "start thinking about it this year, possibly by November 30th, 2017". However, a question arises quite naturally: are we ready?

The other three talks of the session tried to answer in the affirmative. Among the several PQC proposals that are around in theoretical papers, two made their ways into RWC: the well-stablished lattice-based cryptography and the new-born isogeny-based cryptography, which nevertheless carries the pride and sympathy of ECC.

Lattices and funny names: NewHope and Frodo and Crystals
Lattice-based cryptography has three representatives in the run for PQC schemes. Valeria Nikolaenko showed two: the first one is called NewHope and is a key agreement protocol based on the hardness of Ring-LWE. The latter is a problem very favourable to applications because it combines sound theoretical security (worst-case to average-case reduction) to fast implementations thanks to specific choices of parameters which allow for speed-ups in the computations: NewHope turns out to be even faster than ECC and RSA, but at the price of a larger communication. However, there are some concerns on the security of LWE when the ring structured is added. Thus, Frodo ("take off the ring") is designed to achieve the same goal using only standard LWE. The drawback is a degradation in performance, since the tricks hinted above cannot be used anymore and keys are generally bigger.

The third lattice-based scheme was presented by Tancrede Lepoint and is a suite called Crystals. This is based on yet another kind of lattices: module lattices, for which it is also known a worst-case to average-case reduction. These are less structured lattices (hence possibly calming down the detractors of ring structure) in which similar implementation speed-ups are possible: the timing is indeed comparable to NewHope's, while the communication is improved.

"Make elliptic curves great again"
Michael Naehrig presented a new proposal for PQC: do you remember curves with plenty of small subgroups where to easily solve the discrete logarithm problem? Now they come in handy again: all the subgroups (of order 2 and 3) are considered to be nodes of a graph, whose edges are the isogenies (a.k.a. bijetive homorphisms between curves). In this new context, given two curves in the graph, it is difficult to come up with the isogeny linking the two. However, such a new approach doesn't really stand against other solutions: keys are small but performance is not a pro (so to speak).

[The same blog post appeared here.]

Thursday, December 29, 2016

[33C3 TLDR] PUFs, protection, privacy, PRNGs

Pol van Aubel gave an excellent introduction to Physical(ly) Unclonable Functions from both a modern and historical perspective. If you want to know how to exploit intrinsic random properties of all kinds of things, give this talk a shot and don't be afraid of all the footnotes.

Takeaway: PUFs are kinda neat and if you're interested check out this talk.

[33c3 TLDR] Decoding the LoRa PHY


Matt reverse engineered the PHY layer of the proprietary and patented LoRa techology and even wrote a GNUradio plug-in for it.

He also gave an excellent introduction to basics of the LoRa technology.
Actually, it was somehow an introduction to basics of radio technology in general and I highly recommend watching his talk.

Video

[33c3 TLDR] Making Technology Inclusive Through Papercraft and Sound

A talk by bunnie introducing their Love to Code program, a program for teaching children (adults too) to code using cheap papercraft-inspired devices and tools. The argument is that the diversity problem that exists in coding can maybe be fixed by starting at the earliest stages, i.e., starting with children.

From a technical perspective they had to solve some interesting problems to make the technology fast, cheap and usable. In particular, they have a pretty awesome technique using sound to upload code to the microcontrollers they use.

Takeaway: Pretty cool technology for a societally relevant goal. Check it out.