You may remember Ralph's post on git basics from a little over a year ago. In this post, I'll share three things I've learned are possible (and practically painless) to do with git that go beyond the basic add, merge, push, and pull. Everything in this post is done from the command line.
Scenario 1
You're working on a project with others. These hardworking colleagues of yours just pushed a bunch of commits and you want to see exactly what they changed since your last commit.
What to do
Use git diff. Besides using it to show unstaged changes, you can also use git diff to show changes between any two commits. Suppose that after your last commit, your colleagues made 4 commits. To see the changes, use git diff HEAD~4 HEAD or git diff HEAD^^^^ HEAD. If you don't want to count how many commits there were, you can look up the abbreviated commit IDs (SHA-1 hashes in hex) using git log --oneline. Then, you could use something like git diff 54180a8 129ec44.
There's a lot more to git diff: you can use it to compare specific files, you can show differences at the word level rather than at the line level, etc. If you want to learn more about it, see its page on the git website. If you're working on LaTeX files, git-latexdiff is convenient: given any two commits, it will produce a PDF file showing the changes, with removed text striked through and in red, and added text underlined and in blue.
Scenario 2
You were in the zone, hacking away at multiple issues and successfully completing all of them. Well done, you! Wait—each issue should have its own commit...
What to do
Use interactive staging: use git add -p or git add --patch to add the file or files. Git will look at each file you want to add and split the changes into hunks. For each hunk, it will show you the diff and ask you what to do with it:
Stage this hunk [y,n,q,a,d,/,s,e,?]?
Here is the full list of options. They're pretty easy to remember, especially if you're a vi user who is used to navigating with HJKL.
y - stage this hunk
n - do not stage this hunk
q - do not stage this hunk or any of the remaining ones
a - stage this hunk and all later hunks in the file
d - do not stage this hunk or any of the later hunks in the file
g - select a hunk to go to
/ - search for a hunk matching the given regex
j - leave this hunk undecided, see next undecided hunk
J - leave this hunk undecided, see next hunk
k - leave this hunk undecided, see previous undecided hunk
K - leave this hunk undecided, see previous hunk
s - split the current hunk into smaller hunks
e - manually edit the current hunk
? - print help
git add -p is a powerful command that helps you keep your commits reasonably sized. It does require some care, though, since each individual commit should be consistent.
Scenario 3
You have to stop working, but you haven't finished fixing the issue you're working on.
What to do
You should do something because you don't want to have a dirty working directory.
Option 1: commit now, then use git commit --amend (introduced in Ralph's post) once you've finished what you were working on. git commit --amend is useful for a bunch of other things, like adding files you forgot to stage to a commit and fixing typos in a commit message.
Commits are local, so what should you do if you're worried about hardware failure? If you're working on a personal project, it may be acceptable to push this commit and later push the amended commit. You'll need the -f (--force) flag for the latter push. If you're working on a project with others, however, it would be bad practice to amend a commit that you've already pushed. Instead, you could create a temporary personal branch, commit your changes to this branch, and push it to the remote repository. Then, you could push the amended commit to this branch without worrying about rewriting anyone else's history and merge it with the main branch when you've finished fixing the issue.
Option 2: shelve your changes with git stash, which will sweep away both staged and unstaged changes, leaving you with a clean working directory. You can stash changes more than once. To see all stashes, use git stash list. To re-apply the most recent stash you made, use git stash pop. By default, git stash excludes new files (that haven't yet been staged) and ignored files. git stash is also useful when you want to see upstream changes made by someone else, but aren't ready to commit your work.
There's much more you can do with stashes: apply one stash to multiple branches, delete stashes you no longer need, stash parts of a file, etc. Stashes are always local; they can never be pushed to a remote repository. Atlassian has a good, detailed tutorial on git stash.
In this post, I'll summarize three fantastic talks from what was one of my favourite sessions of the ACM CCS Conference last month (session 11B: "Attacks using a little leakage").
The setting common to the three papers is a client-server set-up, where the client outsources the storage of its documents or data to a server that's not entirely trusted.
Instead of using the server just for storage, the client wants to outsource some computations on this data too—keyword searches or database queries, for example.
The problem is to find the right cryptographic algorithms that allow efficiently making these searches and queries while minimizing the information leaked from communication between the client and server, or from the server's computations.
Generic Attacks on Secure Outsourced Databases (paper, talk)
Georgios Kellaris (Harvard University), George Kollios (Boston University), Kobbi Nissim (Ben-Gurion University) and Adam O'Neill (Georgetown University)
The Shadow Nemesis: Inference Attacks on Efficiently Deployable, Efficiently Searchable Encryption (paper, talk)
David Pouliot and Charles V. Wright (Portland State University)
Breaking Web Applications Built On Top of Encrypted Data (paper, talk)
Paul Grubbs (Cornell University), Richard McPherson (University of Texas, Austin), Muhammed Naveed (University of Southern California), Thomas Ristenpart and Vitaly Shmatikov (Cornell Tech)
1. Generic Attacks on Secure Outsourced Databases
This paper presents two attacks, one exploiting communication volume, and one exploiting the access pattern. "Generic" means that they apply to any kind of encryption, not necessarily deterministic, or order-preserving, or even property-preserving.
Setting
outsourced relational database (collection of records, where each record has some number of attributes)
Adversary's goal
reconstruction of the attribute values for all records
Model
database system is static (read-only; queries don't modify records), atomic (each record is encrypted separately), non-storage-inflating (no dummy records), and has records of a fixed length
Assumptions about data
attribute values are ordered (say, numerically or alphabetically)
Assumptions about queries
uniform range/interval queries (for an attribute with $N$ possible values, there are $\binom{N}{2}+N$ possible queries)
Adversary's knowledge
set of possible attribute values
Adversary's capabilities
passive, observe queries and either access pattern (which encrypted records are returned) or communication volume (how many encrypted records are returned)
The big (possibly unreasonable) assumption is that the range queries must be uniform.
However, as the authors point out, the attack model is otherwise weak and the security of an outsourced database shouldn't depend on the query distribution.
Attack using access pattern leakage
The adversary observes at least $N^2\cdot \log(N)$ queries, so with high probability, all of the $\binom{N}{2} + N$ queries have occurred (proof: ask a coupon collector).
For each query, it sees which encrypted records were returned.
Suppose the database has $n$ records, and assign a binary indicator vector $\vec{v} \in \{0,1\}^n$ to each query.
A 1 in position $i$ means that the $i$th encrypted record was returned as part of the query results.
The Hamming weight of this vector is the number of matching records.
The attack works as follows.
Find one of the endpoints. Pick a query that returned all but one of the records (i.e., a query whose associated vector has Hamming weight $n-1$). Let $i_1$ be the index of the 0 in its characteristic vector.
For $j=2$ to $n$, find a query whose characteristic vector has Hamming weight $j$, with $j-1$ of the 1's in positions $i_1,\ldots,i_{j-1}$. Let $i_j$ be the index of the other 1 in the vector.
This algorithm puts the encrypted records in order, up to reflection, and is enough for the adversary to reconstruct the plaintext!
The paper also describes a reconstruction attack for the case that not all values of the domain occur in the database.
It requires seeing more queries, about $N^4\cdot \log(N)$.
Attack using only communication volume
The main idea of this attack is to determine the distance between "adjacent" values.
The adversary observes $q \geq N^4\cdot \log(N)$ queries.
For each query, it sees how many encrypted records were returned.
In the case that not all of the $N$ possible values occur in the database, the attack works as follows.
(It is much simpler when they do.)
Let $r_i$ be the hidden value of record $i$ (i.e., its position in the range 1 to $N$).
Determine the approximate number of distinct queries that returned a certain number of records.
Let $c_j$ be a count of the number of queries that returned $j$ records, for $0 \leq j \leq n$, so $\sum_{j=0}^n c_j = q$.
Scale all of the $c_j$s by $\frac{N(N+1)}{2}\cdot \frac{1}{q}$.
Let $d_i = r_{i+1} - r_i$ be the difference in position of the $(i+1)$st record and the $i$th record, for $i=0$ to $n$, when the $n$ records are sorted.
To keep the notation simple, define $d_0 = r_1$ and $d_n = N + 1 - r_n$.
Note that $c_j = \sum_{i=1}^{n+1-j} d_{i-1}\cdot d_{j + i-1}$ for $j=1$ to $n$, and $c_0 = \frac{1}{2} ( \sum_{i=0}^n {d_i}^2 - (N+1) )$.
Factor a cleverly-constructed polynomial to recover the $d_i$s.
Replace $c_0$ by $2\cdot c_0 + N + 1$.
Let $F(x) = \sum_{i=0}^{n} c_{n-i}\cdot x^{i} + \sum_{i=0}^{n} c_{i}\cdot x^{n+i}$.
Then $F(x)$ factors as $d(x) \cdot d^R(x)$, where $d(x) = \sum_{i=0}^n d_{i}\cdot x^i$ and $d^R(x) = \sum_{i=0}^n d_{n-i}\cdot x^i$.
Compute the attribute values from the $d_i$s: $r_1 = d_0$ and $r_i = r_{i-1} + d_{i-1}$ for $i=2$ to $n$.
The success of this algorithm depends on $F(x)$ having only 1 factorization into two irreducible polynomials.
Also, since factorization can be slow when there are many records in the database, the authors also tested a simple, brute-force algorithm for checking the $d_i$s and it performed better than factorizing in their experiments.
2. The Shadow Nemesis: Inference Attacks on Efficiently Deployable, Efficiently Searchable Encryption
This paper presents attacks on two efficiently deployable, efficiently searchable encryption (EDESE) schemes that support full-text search.
The first scheme they attack is ShadowCrypt, a browser extension that transparently encrypts text fields in web applications without relying on something like client-side JavaScript code.
The second is Mimesis Aegis, a privacy-preserving system for mobile platforms that fits between the application layer and the user layer.
These two EDESE schemes work by appending a list of tags (also called "opaque identifiers") to each message or document, corresponding to the keywords it contains.
You can think of each tag as a PRF with key $k$ applied to the keyword $w$, so $t=PRF_k(w)$.
Setting
(web) applications that store sets of documents/messages in the cloud and allow keyword search on these documents
Adversary's goal
determine which keywords are in a document
Model
each document/message has a set of tags corresponding to the keywords it contains
Assumptions about tags
the same keyword occurring in multiple documents yields the same tag
Adversary's knowledge
auxiliary dataset providing frequency of keywords and keyword co-occurrence statistics
Adversary's capabilities
passive, sees encrypted documents/messages and lists of tags
What I found most interesting about this work was that the problem of determining which keywords are associated with which documents was reduced to problems on graphs!
The weighted graph matching problem is the following.
Given two graphs $G=(V_G,E_G)$ and $H=(V_H,E_H)$ on $n$ vertices, each with a set of edge weights $w(E): E \rightarrow \mathbb{R}^{\geq 0}$, determine the mapping $\sigma: V_G \rightarrow V_H$ that makes the graphs most closely resemble each other.
(This type of "matching" is about matching a vertex in one graph to a vertex in another graph; it has nothing to do with maximal independent sets of edges.)
There are a few different possibilities for what it means for the graphs to "most closely resemble each other"—the one used in the paper is minimizing the Euclidean distance of the adjacency matrix of $G$ and the permuted adjacency matrix of $H$.
The labelled graph matching problem is just an extension of the weighted graph matching problem where each vertex also has a weight.
The two graphs that will be matched to learn which keywords are in which documents are $G$, whose vertices correspond to the $n$ most frequent keywords in the auxiliary data, and $H$, whose vertices correspond to the $n$ most frequent tags in the target data.
The weight of an edge between 2 vertices is the probability that those two tags (or keywords) occur in the same encrypted document (or document in the auxiliary data set).
To form an instance of the labelled graph matching problem, the vertices are assigned weights that correspond to their frequencies in the target data set or their frequencies in the auxiliary data set.
The authors implemented their weighted graph matching and labelled graph matching attacks on two data sets, based on the 2000-2002 Enron email corpus and the Ubuntu IRC chat logs from 2004-2012.
Their attacks accurately recovered hundreds of the most frequent keywords—see the paper for more details about the results.
And while you're checking it out, read the authors' observation about how critical it is to properly choose Bloom filter parameters when using them to replace the usual inverted index structure in a searchable encryption scheme.
3. Breaking Web Applications Built On Top of Encrypted Data
This paper is cheekily titled to reflect the particular system that it attacks—it's from the paper "Building Web Applications on Top of Encrypted Data Using Mylar".
Mylar is an extension to Meteor, a JavaScript web application platform.
The result is a complete client-server system that claims to protect users' data on the way to and at the server.
Mylar's back-end is an instance of MongoDB, a non-relational database where the data is a collection of documents, and each document has a number of key-value pairs.
The main ingredient in Mylar is multi-key searchable encryption (MKSE), which allows users to share data.
The MKSE scheme used in Mylar was built to satisfy two properties: data hiding and token hiding.
One of the results of this paper is proving by counterexample that a scheme that's both data-hiding and token-hiding does not necessarily provide indistinguishable keyword encryptions and keyword tokens.
One of the things I like about this paper is the taxonomy it introduces for real-world adversarial models.
A snapshot passive attack is a one-time, well, snapshot of the data stored on a server.
A persistent passive attack involves observing all data stored on a server and all operations the server performs during a certain time period.
An active attack is one where anything goes—the server can misbehave or even collude with users.
The main part of the paper evaluates the security of a few Mylar apps—one that was already available (kChat, a chat app), and three open-source Meteor apps that were ported to Mylar.
The three apps are MDaisy, a medical appointment app, OpenDNA, an app that analyzes genomic data to identify risk groups, and MeteorShop, an e-commerce app.
Before summarizing some of the paper's results, it's important to understand principals, which in Mylar are units of access control and have a name and a key pair.
Every document and every user has a principal, and a principal can also apply to multiple documents.
The paper's main findings are grouped into three categories: exploiting metadata, exploiting access patterns, and active attacks.
First, here are some examples of exploiting metadata in Mylar:
The names of principals, which are unencrypted to facilitate verifying keys, can leak sensitive information.
For example, in kChat, the names of user principals and chat room principals are simply usernames or email addresses and the chat room name.
Mylar's access graph, which records the relationships between users, access control levels, and encrypted items, can leak sensitive information. For example, in MDaisy, this access graph could reveal that a particular user (a doctor or other health care professional) regularly creates appointments and shares them with the same other user (a patient). A passive snapshot attacker could combine this leakage with knowledge of the doctor's speciality to infer that a patient is being treated for a certain condition.
The size of a MongoDB document associated to a user principal can leak sensitive information.
In MDaisy, each user, whether staff or patient, has its own document.
However, staff have only their names stored, while patients have additional information stored, such as date of birth.
Exploiting access patterns of searchable encryption is not new, and Mylar didn't claim to hide them, so I won't say anything further about this.
The active attacks, however, are interesting, because Mylar claimed to protect data against an actively malicious server as long as none of the users who can access it use a compromised machine.
This claim is false, and the paper describes attacks that arise from properties such as the server being able to forcibly add users to a "tainted" principal.
After a user is added to a principal, it automatically computes and sends to the server a "delta" value that adjusts search tokens so documents encrypted with different keys can be searched.
Once the malicious server receives a user's delta value for a tainted principal (whose keys it knows), it can then search for any keyword in any of the user's documents!
These three talks are evidence that we still have a long way to go to get secure, usable encryption that still preserves some functionality, whether in web applications, or outsourced databases or documents. It is hard to get things right. Even then, as the authors of the Shadow Nemesis paper point out, careful tuning of the Bloom Filter parameters thwarts their attacks on Mimesis Aegis, but there's no reason to believe that it's enough to defend against any other attack. I hope that these recent papers will inspire more secure constructions, not only more attacks.
P.S. There was some good discussion about attack vs. defence papers in the CCS panel discussion on the impact of academic security research (video).
Within the ECRYPT-NET program we have the chance to learn effective communication of scientific results. Although often regarded as "soft skills", thus indicating their lesser value compared to the classical "hard skills", these methods are versatile. Obtaining them means being able to apply the techniques to either written or oral presentations.
In particular at the School on Design for a secure Internet of Things our oral presentation training started with practical sessions. Classical topics like how to deal with stage fright and non-verbal communication such as body language were demonstrated and different ways to know and connect to the audience were addressed.
Among others, we spoke detailed about posture, use of voice, articulation on the one hand and style, amount of information per presentation slide as well as how to present in a lively way yet giving a memorable take-home message on the other hand.
Later this year, during the Cloud Summer School recently in Leuven we had the chance to actively participate the practical training session on academic writing to concisely yet accurately convey future research findings. Our trainer taught us how to chop-up long, unwieldy sentences into more easly digestible ones by reorganizing competing foci and effectively conveying the essence. Showing worked examples from our area and even some excerpts of our own texts definitely helped in understanding the concepts.
There, all of us ECRYPT-NET early stage researchers gave their (possibly first, big) presentation and we received feedback from the board and our individual supervisors. The combined, but especially the individual, feedback of good choices and less fortunate formulations were brought up.
As we'll have more and more tasks to work on as a group coming up in our career, developing an efficient work-flow for joint-documents is not to be considered wasted effort and time. Thus, soon we'll have the chance to obtain more soft-skills that can turn out to be useful in a dynamic setting.
I want to take the opportunity to organize a workshop as a follow-up event of the upcoming Passwords2016-conference for us ESRs. The plan is to have a day of sessions and meetings in Bochum on Thursday, 8.12.:
It'll be co-located with the 11th International Conference on Passwords, where you can listen to the Long Story of Password (Re-)use, learn Password Hardening techniques, debate whether new Password Management strategies based on Episodic Memories or Through Use of Mnemonics is revolutionary and find out about the Hopes and Fears of Mobile Device Security.
About 100 international participants from academia, companies, ... will come and follow the conference program, which is a mix
of classical talks and more practical "hackers sessions", showing attacks in the real-world setting.
Unfortunately there is no detailed program available yet, but soon for sure.
However, the registration is open now until 2016-11-30.
For those interested to join this event, try to save some € by obtaining the early bird offer (only until 2016-10-31).
All other fellows are warmly invited to come on Wednesday, 7.12 after the Passwords2016-talks for a first ECRYPT-NET get-together in Bochum and our program tailored to the upcoming writing project.
Thursday, 8.12 is the main day which we will start off by a workshop followed by project-oriented brainstorming and group discussions.
We'll have an expert training on collaborative creative writing and group
communication (in English), working in nice atmosphere close to Bochum's botanical gardens.
In short, in this workshop the focus lies on how to get things done, collaboratively. The technical aspect will be covered by a git repository with the projects' skeleton, that I set up here on a RUB-Server, where one can get a git account even as a RUB-external.
If you need a refresher, reread Ralph's blog post about git basics.
Let's have a break from pure cryptography for a
moment. Lift your head from the security proof you are trying to come up with,
or from the poly-time factoring algorithm you are coding. Now look around: each
object surrounding you has passed through a lot of stages before becoming the commercial
product you stare at. There is also a good chance that many of them were once ideas
in published research papers, probably similar to those we are all planning to
write at some point. Papers not only carrying academic work but content
that, through efforts and developments and visions, ended up being an object in our
everyday life.
"The Enterprising Researcher: Innovation,
Entrepreneurship and University IP" is a workshop I attended organised by
Bristol Basecamp. It shed light on the process an idea needs to undergo to
become a product and gave insights on the fact that the word "entrepreneurship" carries a stronger
meaning than just "commercialising an idea" (valuable by itself). It
suggests a sense of innovation and creation: like an artist who uses brushes to
outline figures (and more importantly emotions), the entrepreneur combines
resources and ideas to shape new markets and products (from which, most importantly, desires of people arise). This poetic point of view was particularly evident in the
first talk.
Harry Destecroix and Laurent Chabanne are respectively
CEO and CSO of Ziylo, a spin-off of the University of Bristol with expertise in
continuous carbohydrate sensing technology. They shared their experience
in bringing an academic idea to the market: successes, problems, obstacles and
satisfactions are all part of the start-upper's life! Among all, I was
impressed by how much passion and dedication they put in their work, hence I
will often quote their exact own words to briefly describe their story, which is really
worth listening to.
Their journey began with a scientific discovery (sugar-sensing
platforms using synthetic receptors to isolate glucose-like structures) which
was the outcome of research carried out by Professor Anthony Davis' research group at the University of Bristol. We are in the academic world for now,
hence usual laws governing scientific publications still rule. Entering the
market is different: first of all, it requires money. Since the idea was born
in the university, the first source of help came from the inside. Research and Enterprise Development (RED) provides support and assists the growth of
entrepreneurs from within the university. This is not just the step where you
need money though, but it is also the phase during which awareness and information about business are vital. As they said, you need to "find complementary skill sets to
yours", for instance in someone who's expert of marketing because "it costs a lot of
money and time to write a business plan" (and it’s not something they teach you in school, I may add). Eventually, you need to "stop talking about tech and do marketing". I also enjoyed the human factor around this point, as they suggested to start these kinds of adventures with people you trust, with friends! It reminds us that start-ups as well as huge corporations are first of all human communities.
Another crucial point is that the trip is not easy. Now
they are successfully working with SETsquared and Innovate UK, but raising
funding was an issue to face and there might be moments in which the project
seems to be failing: "you need to get used to the fact that you can lose your
job at any time. If you’re not ok with that, don't bother". The important thing
is to always have a positive attitude and to have faith in what you're
doing, after all "without downs you don't know when you're up".
Funding is just one (although quite steep to climb) obstacle you might encounter in your way to the market. I would like to reference another one they mentioned as there is a very important entrepreneurship lesson to learn there. Being a company in biochemistry, they needed labs: expensive and dedicated structures, not the kind listed on Rightmove! Instead of coping with their specific problem, they identified a potential market and they (very recently) founded NS Science, a commercial real estate company focused on providing labs and science facilities to businesses. I would like to stress the moral about "being entrepreneur" laying behind this example: they didn't just commercialise a service, but they shaped the need of a community. Now that their company is around, other people who have similar issues may feel the desire to produce something new thanks to NS Science's facilities. Even better: NS Science may inspire other people who put aside their ideas because of a lack of spaces, for instance. Let me give a final (rather famous) example of the "needs/desires creation" aspect of entrepreneurship: did you need iPads before they were invented? What changed apart from the invention of iPads itself? I believe that, after scratching the surface of possible answers, the outcome could be incredibly surprising.
The second talk took a step back: how to publish research ideas and related legal aspects. Kathryn Smith, Research Engagement
Librarian, focused on the importance of easily available research papers through
green open access, that is to say repositories of manuscripts that differ from
the published version (even just in the formatting, sometimes) so that they can
be freely released. This is motivated by the fact that some entities may have
restricted access to research outcomes (industry, public sector, charity foundations…) and that "you
don't know who's out there: investors, new collaborators, people who could
possibly develop your work in ways you hadn't even thought of". In our specific
field, ePrint is an example of green open access. The University of Bristol also has its own green
open access repository called Pure, whose team also checks for possible legal incompatibilities with the journals the paper is published in. In these regards, she pointed
to a really useful search engine for manuscript publication policies of
journals: SHERPA/RoMEO. For example, the picture shows the result of querying "LNCS",
where being a "RoMEO green journal" means that you "can archive pre-print and post-print or
publisher's version/PDF".
The third talk was also focused on legal aspects: Sue
Sunstrom, Head of Commercialisation and Impact Development, answered several
questions about Intellectual Property (IP) and how they relate to the
university. First of all, there exist different kinds of IP (trademark,
copyright, patent, trade secret, design...). They offer different features and
are applicable to different (either concrete or abstract) objects. Since it can be quite expensive, choosing the right form of IP (hence the right
form of protection) is crucial. But defence against competitors is not the only
reason why IPs matter: investors like them because they are a proof you own
something different and possibly innovative. They make the object they protect
valuable in some situations (think of industrial secrets, for instance) and, as
every form of value, they can be used as a currency in commercial trades. The interest of universities in IPs on the outcomes of research stems from various motivations: develop practical applications (hence have an impact on society), attract funding (for new research, hence possibly new IPs, entering a virtuous circle) and strengthen the link with industry are just few of them.
The last talk was given by Prof Alan Palmer on translational life science research, exemplified by the process of commercialising drugs. Such a topic is indeed strictly related to the biomedical field and is about academic and industrial efforts made in order to develop new solutions for prevention,
diagnosis, and therapies. I am intentionally brief here, just have a
look at Prof Palmer’s Linkedin profile to have a feeling of who an entrepreneur is!
Back to our beloved crypto, I think that this event has added a brick to my personal awareness
(and I hope yours, after this reading) about what is going on out there, following what AvonCrypt had started. That was the perfect occasion in which to see how these
realities exist in our field too (and I have the feeling it is particularly
fertile). Thanks to this event, I understood what probably happened to those companies behind the scenes and I also found out another example (my bad I didn't know
it before!): there is a start-up in Bristol called KETS which has recently won a prize and related funding for their usage of "quantum cryptography to improve
data encryption, ensuring information is safe in all situations, from bank
transactions to critical infrastructure, and to individuals shopping online
from the comfort of their own home".
As Harry and Laurent said during their talk, "start-upping
involves learning" so "take a book and read about business and economics": a suggestion I will certainly follow.
Marco
(Thanks to Matthias, Simon and Marie-Sarah for comments and corrections)
Last Friday, I went to the National Museum of Computing (TNMOC) in Bletchley Park, home of the British Government Code and Cypher School during World War II.
It was hosting a special event to celebrate two new acquisitions: a Lorenz teleprinter recently found on eBay and a Lorenz SZ42 cipher machine on long-term loan from the Norwegian Armed Forces Museum.
TNMOC now has all of the key parts (either original or rebuilt) used in the process of encryption, interception, and decryption of Lorenz messages sent by the German High Command during WWII.
Five women of the WRNS (Women's Royal Navy Service, whose members are often called "wrens") who operated Colossus and the relatives of others who contributed to breaking Lorenz attended this special event.
John Whetter, one of the leaders of the team at TNMOC that rebuilt the British Tunny (in the background), holds a Spruchtafel, next to the Lorenz machine on loan from Norway.
The Lorenz cipher
The story of Lorenz, Bill Tutte, and Tommy Flowers is perhaps less well known than the story of Enigma and Alan Turing.
The Enigma machine encrypted messages sent among units of the German army, navy, and air force.
It had 3 or 4 rotors and operated directly on an alphabet of 26 letters, which were then transmitted in Morse code.
The Lorenz SZ42, on the other hand, was custom-built for the German High Command to send the most important strategic messages to its Field Marshals.
It was more complex and less portable than an Enigma machine.
The Lorenz cipher could handle letters, punctuation, and spacing: each character was encoded as 5 bits according to the Baudot code.
The machine had 12 wheels, each with a number of cams (pins) on it.
The numbers of cams on the wheels were co-prime.
The "key" was in two parts: the starting position of each wheel ("wheel setting"), and the pattern of raised or lowered cams on each wheel ("wheel pattern").
The wheel settings were supposed to be changed for each message, while the wheel patterns were changed infrequently—for the first few years.
When the wheel patterns did begin to change more frequently, however, Colossus II was operational and could find them.
The entire process of intercepting a message went roughly as follows.
1. Setting up to send the message
The sending operator in Berlin picks six pairs of letters at random from a prepared sheet.
He or she types them in to the teleprinter (without the Lorenz machine attached). The output is a paper tape with punched holes corresponding to the Baudot encoding of the letters.
Next, the operator uses a board of wheel settings (a Spruchtafel) to determine the starting position of the Lorenz SZ42's rotors. Each of the letters corresponds to a number.
German Lorenz operators consulted a Spruchtafel to determine which wheel settings (starting positions) to use based on a given 12-letter indicator. (source)
2. Encrypting and sending the message
Now, the teleprinter operator in Berlin hooks up the Lorenz encryption machine to the teleprinter and types the plaintext message.
The encrypted message is output on the same perforated paper tape, again encoded with the Baudot code.
The paper tape corresponding to the 12-letter indicator and the ciphertext is fed to a radio transmitter, which broadcasts it.
3. Intercepting the message
Radio receivers at an intercept station at Knockholt, Kent (south-east of London) pick up the encrypted message.
The faint signals are fed to an undulator, which uses an ink pen to record a continuous trace of the signal on a strip of paper tape, the "slip".
Slip readers (people, not machines) translate the highs and lows on the slip to characters according to the Baudot code. To minimize errors, two or more slip-readers read each transmission.
The characters are typed in to a perforator that produces another strip of paper upon which the characters are encoded in Baudot code.
The intercepted message is sent to Bletchley Park (100 km away) in two ways: by secure landline and by motorcycle courier.
4. Decrypting the message
The perforated tape is fed to Colossus, which outputs the most likely wheel settings (Colossus I) and wheel patterns (Colossus II onwards).
The input to the Colossus machine is perforated paper tape with characters in 5-bit Baudot code.
WWII-era cryptography vs. modern cryptography
I went to TNMOC with Thyla van der Merwe, another PhD student at Royal Holloway, to speak to the guests for a few minutes about cryptography today and how it works now compared to how it worked in the WWII era.
Thyla and Marie-Sarah next to TNMOC's rebuilt Colossus.
Thyla explained the benefits of using a stream cipher, like the Lorenz cipher—they're fast, they don't propagate ciphertext errors, and they require only small buffers. These properties made it appropriate for encrypting radio transmissions. She pointed out how ordinary citizens of the WWII era probably didn't use encryption, while today, it is ubiquitous: everyone who's been online or had a cell phone has used it.
I talked about what makes "modern" cryptography different.
At the time of WWII, public-key cryptography had not yet been discovered, so sharing keys for any kind of symmetric protocol was still hard.
Cryptography in that era also didn't have the precise definitions, clear assumptions, and rigorous security reductions we have today. (Katz and Lindell's textbook does a wonderful job of explaining these three features of modern cryptography.)
Although these more formal aspects of modern cryptography are powerful, their strength in the real world is limited in two ways.
First, they may not capture all of the information or capabilities an attacker may have (e.g., side-channel attacks).
Second, and maybe even more importantly, they come with the assumption that protocols are implemented and used exactly as they should be.
For example, cryptographers know how important it is that a stream cipher (like the Lorenz cipher) never re-uses the same keystream for different messages, because the XOR of two ciphertexts would equal the XOR of the two plaintexts.
If the two messages are similar, then keystream re-use is particularly dangerous.
This mistake is exactly what led cryptanalysts at Bletchley Park to decrypt two long messages and obtain 4000 characters of keystream: in August 1941, a long message was retransmitted with a few minor changes, but with the same key settings.
Within a few months, cryptanalyst John Tiltman had recovered the keystream.
By January 1942, Bill Tutte had fully reverse-engineered the Lorenz machine... without ever having seen it!
The operator or implementer of a cryptographic protocol that uses a stream cipher may not understand how important it is that the keystream never be re-used, or may simply make a mistake.
This type of mistake hasn't happened only in WWII.
In the late 1990s, the IEEE 802.11 standard specified the WEP (Wired Equivalent Privacy) protocol for Wi-Fi networks.
WEP uses the stream cipher RC4 to encrypt traffic from access points (wireless routers) to mobile stations (all devices wirelessly connected to the network).
Partly due to the WEP protocol's design, and partly due to how the access points' keys tended to be managed in practice, the same RC4 keystream was frequently re-used in implementations.
(The key supplied to RC4 was a concatenation of a 24-bit IV, sent in plaintext along with each encrypted message, and a 40-bit shared secret key, which was rarely changed.)
Read more about WEP's shortcomings in Borisov, Goldberg, and Wagner's CCS 2001 paper.
Modern cryptography may offer many new tools, definitions, and rigorous proofs, but some things will never change: designing protocols that are secure in the real world is still really hard, and breaking cryptographic schemes still requires a great deal of creativity, analysis, and dedication.
More about the Lorenz story
Determining how the Lorenz machine worked was only the first step.
Tommy Flowers, an engineer at the Post Office Research Station, designed and built an emulator, "Tunny," of the Lorenz machine.
An entire section at Bletchley Park (the "Testery," named after the section head, Ralph Tester) was devoted to decrypting the messages—which they did by hand for the first 12 months.
Max Newman and Tommy Flowers designed and built machines to speed up the decryption process: the "Heath Robinson" and "Colossus".
Colossus was the first electronic digital machine that was programmable (with plugs and switches).
Heath Robinson and Colossus were operated (and named, actually) by members of the WRNS.
FISH and I: transcript of a lecture given by Bill Tutte at the University of Waterloo in 1998. (Bill Tutte was one of the first members of the Department of Combinatorics and Optimization at the University of Waterloo, in which department I studied.)
Why we care about post-quantum cryptography should be clear by now (if it’s not, you’re in the right place anyway: you just need to take a step back and read this splendidly packaged blog post). A mathematical notion that is having much success in achieving it is the Learning With Errors (LWE) problem (guess what, we've also talked about that here). Thanks to some mathematical evidences, we are quite confident about its hardness so that many cryptographic primitives whose security is based on it have arisen. That said, let’s take a step further and imagine a future (not that far in time) when all those primitives are standardised and deployed at many levels, including hardware security. Our experience with classic algorithms teaches us that we shouldn't be too confident of the security of an implementation given the security of the scheme on paper, and post-quantum cryptography is no exception. In fact, side-channel attacks are known to be a great threat to cryptographic implementations. The question this blog post wishes to partially answer is: how can we secure our (future) post-quantum implementations?
Playing cops and robbers is a way, of course: applying common techniques (e.g. masking, randomisation, …) against known side-channel attacks is possible just by fixing the design of specific schemes. However, this approach has at least a couple of drawbacks:
a protection is not guaranteed to thwart future attacks (sometime new attacks are specifically designed to circumvent a countermeasure). More generally, a countermeasure doesn't protect against anything outside what it was designed to protect against;
usual countermeasures introduce overhead (in terms of whatever metric you prefer, e.g. area, time, ...). This results in a protected scheme to be less efficient than its unprotected version.
Leakage-resilient cryptography is a possible alternative: a cryptographic scheme is provably secure even in the case of some anticipated amount of leakage, by design. Usually such proofs only rely on a bounded amount of leakage that the scheme can tolerate. The main idea sounds like the following statement.
We build a cryptographic scheme which retains security even if a bounded amount of arbitrary information is leaked.
Full stop. There aren't many other assumptions: this fact gives leakage-resilient cryptography enough generality to capture many, if not all, side-channel attacks. In essence, we are assuming that the scheme is secure if leakage occurs, no matter how that is exploited by an attacker. This seems to address the first drawback of ad-hoc countermeasures, provided everything is implemented in the proper manner. Is there anything we can do about the second drawback?
If we keep thinking of securing a scheme, then answering to that question is really difficult. Indeed a countermeasure, even the most general and provable one, introduces some changes in the original design which necessarily result in a overhead, for the simple reason that they take into account a more powerful adversary, namely one who is able of measuring and exploiting leakage too. However, think of the case in which none of the anticipated amount of leakage actually occurs. Then we end up having a less efficient implementation, the protected one, which doesn't provide any additional security because, without leakage, the original scheme was secure anyway. Eventually, stop thinking of securing a scheme and start thinking of securing an assumption.
In their paper, Goldwasser, Kalai, Peikert and Vaikuntanathan show that LWE has the nice feature of being hard even in the presence of leakage. Does it sound like what we've just said? Essentially yes, but with a great difference: they're not building a leakage-resilient scheme, hence introducing some sort of overhead, but proving the leakage-resiliency of an assumption. This makes a big difference, because the former implies that the efficiency degrades to let leakage-resiliency grow, while the latter implies that security, not efficiency, degrades. In particular, their proof establishes a relation between the security level and the amount of leakage in such a way that we don't need to anticipate a maximum tolerable amount of it. The straightforward consequence is that a scheme based on LWE is secure independently on whether the leakage occurs or not, with the advantage that in the latter case we have not introduced any protection-related overhead in the design. As a result, they propose a symmetric-key encryption scheme being secure even in the presence of leakage (with some caveats): the novelty relies on the fact that you won't find any parameter describing an amount of leakage (usually called $\lambda$ in the literature of leakage-resilient cryptography).
We omit any technicality here as this would like to be a generic and accessible reading. For further details on how the authors achieved this result, we refer to their paper. Instead, it you prefer to have just a mathematical flavour without going into too many details, you may like this post from the Bristol cryptography group's blog. Stay tuned for more insights on post-quantum cryptography!
P.S.: the word "leakage" appears 19 times in this text. If you're tired of reading it, watch this.
Marco
(Thanks to Simon and Matthias for some corrections!)
When I wrote my first blog entry here ("Ola from Gustavo"), I gave a short description of what will happen when someone presents the first quantum computer, i.e., most of the actual cryptography (ECC, RSA) will be dead, see [1] for more details. One statement about quantum computers and cryptography is: "Well, we don't have a quantum computer now and the expectation for one is around 15 years, why I should be worried now?"
I will give one good reason, imagine that you use some encryption (RSA, ECC or whatever) in your messages and someone starts storing these messages. Now, this agency/person/computer is not able to read your messages. However, in 15 years (around this) when a real quantum computer appears then, it will be possible to break the encryption and read everything about your past. For this reason, we need to be prepared against this threat now and we don't need to wait for the quantum computer to start to protect ourselves. For this protection, we have good guardians of the privacy that with the union of their powers started the research in Post-Quantum cryptography (PQC).
In the figure (see above), we can see that PQC divides into four big areas, i.e., Hash-based, Code-based, Multivariate-quadratic-equations-based and Lattice-based. There are other areas but these four became more important in the last 10 years. In the following, I will give a very brief explanation and examples of schemes of each area in PQC. There is a website about PQCrypto with initial recommendations and a list of publications in the area to access click here.
HASH-BASED Cryptography
In hash-based, we can see important signature schemes such as Lamport-Diffie OTS, Merkle's hash-tree, XMSS, SPHINCS and a lot of relevant work. I decided to talk about one of the newest works in the area, i.e., SPHINCS.
SPHINCS was presented in 2015 at EUROCRYPT under the title of "SPHINCS: practical stateless hash-based signatures". In order to understand better the idea behind this work, it is necessary to have a previous knowledge about hash signatures because they present two new ideas: replacement of the leaf OTS (One-time Signature) with a hash-based FTS (Few-time Signature) and Goldreich’s construction as a hyper-tree construction with h layers of trees of height 1. For more details about the implementation and construction of SPHINCS, it is possible to access here.
Also, if you have more interest in Hash-Based cryptography can be checked via pqcrypto hash-based and at this webpage of Andreas Hülsing.
CODE-BASED Cryptography
The classic name that appears in Code-based is McEliece’s hidden-Goppa-code public-key encryption system from 1978. However, there is new work in the area such as the work from Misoczki, Tillich, Sendrier, and Barreto using quasi-cyclic moderate-density-parity-check (QC-MDPC) codes for the McEliece encryption system. The principal contribution of this work for the area is the use of small key sizes.
Other relevant work, this time in the fast implementation area, came from Bernstein, Chou, and Schwabe under the name of "McBits: Fast constant-time code-based cryptography". The main contribution of this work is a constant-time in the implementation of code-based cryptography and they avoid time attacks.
The literature for code-based is very big and can be checked in this link.
MQ-BASED Cryptography
The Multivariate Cryptography schemes are based on the difficult to solve non-linear systems of equations over
finite fields. Finding a solution for these systems, in general, is considered a NP-complete/-hard problem. One of many
interesting examples is Patarin’s Hidden Fields , generalizing a proposal by Matsumoto and Imai (1988).
The original work from Paratin can be found at: Hidden Field Equations (HFE) andIsomorphisms of Polynomials (IP): two new Families of Asymmetric Algorithms.
If we want to see what happened recently in the MQ area, there is slides from the PQCrypto winter school are available online at PQCrypto . The slides from Jintai Ding gave a very nice overview of the state-of-art in the Multivariate
Public Key Cryptography (MPKC). In the slides, he gave an introduction MQ with a small example to understand
how MQ works. After that, the author gives the principal cryptosystems in MPKC.
The slides are available at this link and more material about MQ-based cryptography can be checked at PQCrypto project.
LATTICE-BASED Cryptography
The lattice area received a lot of attention in the past years, new cryptosystems, new attacks, and implementations. Everything is based on hard problems in lattices such as shortest-vector problem (SVP), Ideal Lattice, Closest-vector Problem (CVP) and others. There is a lattice challenge to show the shortest vector in a Lattice and the ideal Lattice given the dimension of it, more details about it can be checked at Lattice Challenge.
In the area of public-key cryptography, there are several schemes and mostly is based on SVP and Ideal lattices. That is the case for the scheme described by Peikert and Stehlé et al.
Post-quantum Cryptography and ECRYPT-NET
One of the interests of the ECRYPT-NET project is in PQC and we have four fellowships working in this area. In Lattice-based, we have Marco Martinoli, Michele Minelli and Henry de Valence. In the quantum algorithms area I started work with this.
I started to read about all of the areas described before for two reasons. Firstly, I needed to understand the basic of each area to decide where I want to work. Secondly, after all the reading I decided to work with quantum algorithms and attack each area using it. In the following months, I will start writing about quantum algorithms and the impact in the cryptography.
[1] Bernstein, Daniel J., Johannes Buchmann, and Erik Dahmen, eds. Post-quantum cryptography. Springer Science & Business Media, 2009.
This is part one of a two-part blog post collaboratively written by Matthias and Ralph.
The Spring School on Symmetric Cryptography takes place at Beckmanns Hof just one minute south of Ruhr University Bochum.
The Spring school, which can be seen as a supplementary event for FSE 2016 next week, lies literally on the verge of the botanical garden of Bochum!
The Spring School's program covers lectures on the theory of "Boolean Functions" and "Statistical Models" in symmetric cryptography. Additionally there are exercises to gain a firm understanding thereof which were compiled by Anne Canteaut and Celine Blondeau.
The theoretical part is supplemented by the more practically oriented talks of the invited researchers Joan Daemen and Ventzi Nikov.
For the lecture on "Boolean Functions" Anne Canteaut preferred a sympathetic, old-fashioned presentation with chalk on black board in the lecture hall over an overhead presentation in the modern seminar room.
The topic ranged from Boolean functions and their algebraic normal form to the interpretation as Reed–Muller codes. Then the Walsh transform was introduced as a tool to analyze and approximate Boolean functions by linear ones.
In the lecture on "Statistical Models" Celine Blondeau (lecture notes here)
reminded us of basic probability distributions well known to statisticians and we explored some of their uses in cryptography. The discussion lead us from Binomial-, Poisson-, Normal- over Chi-Square- and Hypergeometric distributions to the study of Kullback-Leibler divergence and the applications for i.e. Linear attacks, Differential attacks and Impossible differential cryptanalysis. She also gave directions of ongoing work in that field.
The advantages of "Permutation Based Cryptography" were presented by Keccak (SHA-3) and Rijndael (AES) (co-)designer Joan Daemen.
The talk about "Threshold Implementations" by Ventzi Nikov covered techniques to decompose (non-linear) functions to separate the inputs from intermediate results and masking.
The key-values are masked by splitting up function inputs into separate shares such that knowledge of less than all shares does not allow to recover the value.
Matthias and Ralph will stay tuned for some S-Box theory and the statistical wrap-up on Saturday morning and the slides and recorded videos of this Spring School will soon be available at the official website.
In the past few months, I've learned about many different areas of securely outsourcing storage and computation. One particularly interesting idea I've come across is oblivious random-access machines (ORAMs). In this post, I explain the basics of ORAMs and present the simplest version of path ORAM (eprint, ACM DL).
First, what is a random-access machine?
It's a model of computation.
Essentially, it's a simple computer that has two parts—a control unit and memory—and can run a program stored in its memory by fetching instructions one at a time.
The control unit has a small amount of storage, the registers, and it can read from or write to specific addresses in the memory.
Already, you might see how it could model cloud storage: the data owner (or client) is like the control unit, and the remote storage is like the memory.
In this post, instead of referring to a control unit and memory, I'll refer to a "client" or "data owner" and a "server."
Oblivious RAMs (ORAMs) hide access patterns to the memory or server.
They were proposed by Ostrovsky and Goldreich about 25 years ago (Ostrovsky's PhD thesis).
Their motivation was software protection: ORAMs can prevent reverse engineering of programs—useful for software companies with proprietary implementations (and for malware authors).
But ORAM is useful in settings other than software protection, like searchable encryption and cloud storage.
If you're already familiar with oblivious transfer or PIR, you may wonder how they relate to ORAM. The following table gives an overview of the main differences.
Scheme
Participants
Remote data
Requirements
oblivious RAM
client, server
read and write, private
server doesn't know access pattern
private information retrieval (PIR)
client(s), server
read-only, public
requested item unknown to server
symmetrically private information retrieval (SPIR)
client, server
read-only, public
requested item unknown to server, non-requested items unknown to client
oblivious transfer (OT)
chooser, sender
read-only
requested item(s) unknown to sender, non-requested items unknown to chooser
What does it mean to hide access patterns?
It means that the server (or anyone who sees all client-server communication and what's stored on the server at any time) doesn't learn anything about which data is actually being accessed, whether it is being updated or only read, how frequently it's accessed, whether the same data is accessed twice, etc.
The server should be oblivious to what the owner is doing with its data.
ORAM, more generally, is a cryptographic protocol with three algorithms, Init, Read, and Write, between a client and a server:
Init(B, N): the client allocates some memory on the server to store N data blocks of size B and maybe allocates some local memory.
Read(i): the server obliviously returns data block i to the client.
Write(i, d): the server obliviously replaces block i's data with d.
An ORAM is pattern-hiding if an adversary who chooses two access sequences of the same length can't distinguish which one is being executed, given a complete transcript of communication between the client and server and the contents of the server's memory at any time.
For example, maybe an adversary chooses the two access sequences ( (read, 2), (read, 2), (read, 2) )
and ( (read, 5), (write, 7, 0x30e84a13), (read, 1) ).
It sees the resulting communication transcript, which could look like ( (read, 10), (read, 12), (read, 8),
(write, 10, 0x53d35dd9), (write, 12, 0x16701de9), (write, 8, 0x0eae293c),
(read, 6), (read, 8), (read, 1),
(write, 6, 0x30e84a13), (write, 8, 0xffe8d5a4), (write, 1, 0xc113bc8b),
(read, 7), (read, 2), (read, 6),
(write, 7, 0xbb7da98a), (write, 2, 0x08bbfc25), (write, 6, 0xcaf6d2e2) ).
Given this transcript and what's stored on the server at any time, it shouldn't be able to determine which access sequence the ORAM is actually executing with probability greater than 1/2.
ORAM schemes (like path ORAM) tend to ignore exactly how data blocks are encrypted and instead focus on hiding the access pattern. However, all data should be encrypted, otherwise the server could easily learn about the access pattern. A scheme with IND-CPA security would be sufficient.
Here's a trivial ORAM scheme to show that it is indeed possible to hide access patterns.
Init(B, N): the server allocates BN bits of memory.
Read(i): the client requests all N data blocks from the server, decrypts them, identifies block i among them (e.g., by always prepending a block's identifier i to its data before encrypting it), then re-encrypts all N blocks and sends them back to the server.
Write(i, d): the client requests all N data blocks from the server, decrypts them, replaces the data of block i with d, then re-encrypts all N blocks, and sends them back to the server. This scheme is inefficient, but it hides the access pattern!
Path ORAM
I'll explain the main ideas of an elegant tree-based ORAM protocol, path ORAM (eprint, ACM DL), developed by Emil Stefanov and others.
Init(B, N)
Suppose the client wants to store N data blocks of size B.
In the simplest variant of path ORAM, the storage on the server is treated as a binary tree of height L = ⌈log N⌉ with 2L leaves.
(A binary tree of height L has L + 1 node levels.)
Each node in the tree is a bucket that contains Z blocks of size B, where Z is a small constant (e.g., 3–5) independent of N.
Buckets are padded with dummy blocks when necessary so they all appear full.
An example of a path ORAM tree with height L = 3 and Z = 3 blocks per bucket, that can store up to N = 8 data blocks. These data blocks are indistinguishable from the dummy blocks that fill up the rest of the tree.
The client stores two data structures: a position map and a stash.
The stash stores data blocks during block accesses and data blocks that have overflowed from the tree. It's initially empty.
The position map associates each of the N data blocks to one of the 2L leaves, chosen independently and uniformly at random.
What makes path ORAM work is the following rule: at any time, every data block is stored either in the stash or in a node along the path from the root to its assigned leaf.
An example of a position map for N = 8 items, each assigned one of the tree's 2L = 8 leaves.
Read(i)
To read block i, the client first looks up its associated leaf, PosMap[i]. Then, it sends a read request to the server for the contents of all the buckets along the path from the root to the leaf PosMap[i].
The client decrypts all of these (L+1)Z blocks, discards any dummy blocks, and stores the rest in its stash. It identifies block i among the blocks in the stash.
According to the "law" of path ORAM, the block was either already in the stash or was in the path that is now stored in the stash.
The client updates block i's assigned leaf in the position map, picking another one of the 2L leaves uniformly at random.
Next, it tries to empty the stash, encrypting data blocks with fresh randomness and writing them back to the same path—the path from the root to block i's former assigned leaf.
Data blocks are put as close to the leaves as they can go while still respecting their assigned leaves: a block can be stored in a bucket only if that bucket is also on the path from its assigned leaf to the root. (Any two paths will intersect at the root bucket, at least.)
If there aren't enough eligible data blocks in the stash to fill a bucket, then dummy blocks are added.
If a bucket is full of data blocks and there are still eligible blocks in the stash, they will be eligible for the next bucket in the path, one level up.
If the root bucket is full of data blocks, then any remaining data blocks will stay in the stash.
If you've been following along closely, you might have noticed that after a block is read, it ends up in the root node with probability 1/2. However, even if this block is accessed twice in a row, the access will be indistinguishable from an access to any other block.
Write(i, d)
To write data d to block i, the client proceeds exactly as if it were reading block i, but simply changes the contents of block i to d in the stash before re-encrypting it and re-writing it to the server.
And that's the basic version of path ORAM. Simple, right?
Have a look at my path ORAM simulation on GitHub.
Security
Why is path ORAM "secure" in the sense of hiding the client's access pattern?
Its security is mainly due to the property that each data block is assigned a leaf that was chosen independently and uniformly at random, and is updated after each access. Therefore, each data access consists of reading and writing the path from the root to a random leaf.
Since all values are re-encrypted with fresh randomness every time they are accessed, only the client can distinguish reads and writes.
Efficiency
The client must store the position map and the stash. The size of the position map is NL bits, or O(Nlog N) bits.
During an access, the stash must store an entire path from the root to a leaf node (transient storage).
The transient storage required is (L+1)ZB bits, which is O(log N) data blocks.
Between accesses, the stash must store any blocks that overflowed (persistent storage).
Stefanov et al. show that for a bucket size of Z = 5, the probability that a stash exceeds its capacity of R blocks is at most 14(0.6002)R, i.e., it decreases exponentially with the stash size.
(If you look at the path ORAM paper, you'll notice that the security analysis is half a page, while proving a bound on the persistent stash usage requires 8 pages!)
The server must store the tree, whose total size is (2L+1 - 1)ZB bits, while it is storing NB bits of actual data (i.e., non-dummy data), which corresponds to an overhead of about 2Z.
For each block access, the server must send an entire path from the root to a leaf node, and the client must also send it back. So, the bandwidth cost is 2(L+1)ZB bits.
There's a lot more to path ORAM than what I presented in this post.
If you're interested in learning more about it, its variants, or ORAMs in general, here are a few recent papers: