Friday at Royal Holloway, there was a one-day workshop on private information retrieval (PIR), distributed storage, and network coding. Four speakers gave talks on topics including coding for distributed storage, coding vs. replication, and multi-server PIR schemes.
Coding for distributed storage
The first speaker, P. Vijay Kumar, gave an overview of coding for distributed storage. When a storage node fails (due to hardware failure or corrupted data, for example), there are two main issues to consider: repair bandwidth (how much data it must download to recover) and repair degree (how many other nodes must communicate with it to help it recover). We learned about which types of coding schemes are used in practice. For some applications, triple replication—without any coding—is the standard solution. The RAID 6 storage architecture uses any maximum distance separable (MDS) code that tolerates 2 failures. Facebook uses HDFS-RAID with one of two erasure codes (blog post by Facebook engineers). HDFS-RAID is a modified instance of the Hadoop distributed file system (HDFS, which replicates data three times) that uses erasure codes to reduce the effective replication of data to about 2. Windows Azure Storage uses Local Reconstruction Codes (LRC) (Microsoft blog post, paper).
PIR and coding
The second speaker, Alex Vardy, presented techniques for PIR that use coding instead of replication.
Private information retrieval (PIR) is relatively new; it was proposed in a 1995 paper by Chor, Goldreich, Kushilevitz, and Sudan. The setting for PIR is as follows. A server has a public, static database of n items (bits, for example) and a user wants to retrieve a certain one without the server knowing exactly which. Ideally, a computationally-unbounded server wouldn't be able to get any information about the user's request: the distribution of queries wouldn't depend on the index of the requested data item. But the only PIR scheme for a single database that satisfies this notion of "perfect privacy" requires the server to send the entire database in response to every query. Instead, there are computational notions of privacy for PIR (where the server is computationally bounded) and for settings with more than one copy of the database, there are other information-theoretic notions.
One interesting aspect of the multi-server PIR setting that Alex presented was that the servers are assumed to be non-colluding. There was some discussion about whether this assumption makes sense, and whether any communication at all should be allowed between the servers. For example, if all but one of the servers receives a query, will they know?
The third speaker, Salim El Rouayheb, presented some coding-based PIR schemes with a different security model: some number b of the nodes are passive adversaries who can collude. We learned about Freenet, a privacy-friendly P2P distributed storage system for sharing files (paper).
The last talk, by Tuvi Etzion, was about network coding and its connections to distributed storage and PIR. A network is represented by an acyclic directed graph that can have parallel arcs (edges) and whose arcs each have an associated capacity. We considered two types of multicast networks. In the first, there is one source node that must distribute h messages (field elements) to n receiver nodes. In the second, there are h source nodes with one message each and n receiver nodes that need to receive all of the messages. In the scalar linear setting, each node "transforms" the messages it receives according to its local coding vector. (In the vector linear setting, nodes have local coding matrices.)
An example of a simple linear multicast network. The top two nodes are the source nodes and A and B are the messages they want to send to both of the receiver nodes (the bottom two nodes).
Tuvi shared his conjecture that for a multicast network of the first type with two messages, there is no vector solution that outperforms the optimal scalar solution. (See one of Tuvi's recent papers for related results.) It was neat that properties of multicast networks can be expressed succinctly in terms of graph-theoretic properties, like the size of a minimum cut between source nodes and receiver nodes.
I enjoyed this educational yet relaxed one-day workshop. What I found most interesting was how cloud storage providers and companies like Facebook encode their data. It's clear why they care about even small reductions in effective data replication: they want to maximize availability and tolerate hardware failures while reducing storage and communication costs. As a skeptical cryptographer, I can't help but wonder how they compose these coding schemes with encryption and other cryptographic tools to protect users' privacy.