ECRYPT-EU: PIR

Tuesday, September 6, 2016

Requirements and security goals for the Cloud

This is a two-part blog post giving an overview on the requirements and security goals for the Internet of Things (IoT) and the Cloud.

But first ... I'd like to mention that soon the ECRYPT-NET Summer School on cryptography for the cloud in Leuven is coming up.
This blog post can be seen as an intro to cryptology in the cloud; so there are some topics we will for sure hear about at the event in depth. I'm looking forward to a good discourse during the scheduled talks and to the discussions afterwards.

Part I: Requirements and security goals for the Cloud

The cloud here is loosly defined as computing and storage resources out-sourced to servers located off-site that are on-demand accessible via Internet.

Overview of cloud computing services to be secured
(Image by Wikimedia user Sam Johnston)

The usage of cloud services should thus be viewed as a potential threat, because inherently critical data is released from personal control and uploaded and so immediately raising several security as well privacy issues.

Requirements and the according cryptologic counter-measures are defined for various use-cases (such as depicted in the image) and briefly explained. It is important to realize that this needs to be done without deliberately weakening (or "back-dooring") solutions, as is sometimes suggested. To ensure democratic use of the developed technologies it is vital to see that the difference of a "front door" and a "back door" is merely ones viewpoint. Intentionally implementing two entrances makes an attacker happy not the legitimate user.

Transparent services, meaning the use of the cloud as if it was on-site, should ideally be built upon a back-end with verifiable, clear, possibly standardized, cryptologic concepts and open code for public scrutiny.

To capture the real-world cloud settings one has to view it from different perspectives: First the one of a private entity (user), then that of an organization (such as a company) and lastly the global perspectives of, say, a government.

An interesting starting-point for threats we have to consider, is the Cloud Security Alliance's 2016 report for cloud security naming twelve critical issues ranked by severity of real-world impact.

Data breaches is followed directly by weak identity, credential and access management and insecure Application Programming Interfaces (APIs) in this report --- issues that can be addressed by assessing requirements and tailoring cryptographic solutions from the beginning and deploying state-of-the-art implementations instead of sticking to legacy code.

Roughly four categories of requirements for the usages of the cloud can be distinguished:

Computations in the Cloud
Sharing Data in the Cloud
Information Retrieval from the Cloud
Privacy Preservation in the Cloud

The approach to secure these areas are manifold. General speaking, while data minimization is a fundamental paradigm for efficient interaction with the cloud security-by-design --- considering best-practices from the beginning --- is enabled by encapsulation of scopes of methods.
The following, briefly explained, concepts tackle concrete use-cases for end-users, companies and e-Government tasks alike:

Order-Preserving Encryption (OPE) allows efficient range queries but does it diminish security too much?
Format-Preserving Encryption (FPE) offers in-place encryptions in legacy databases.
Data De-Duplication (DDD) is enhancing servers' back-end performance.
Secret Sharing Schemes (SSS) is solving back-up and availability issues by distributing encrypted parts of the whole.
Malleable Signature Schemes (MSS) is providing flexible authentication for documents and derived data.
Private Information Retrieval (PIR) privately accessing elements in a database.
Provable Data Possession (PDP) ensures that the cloud is storing the (uncorrupted) files.
Multi-Party Computation (MPC) allows secure cooperation across the Internet.
Verifiable Computation (VC) builds trust in results of a delegated computation.

We can observe that the cryptology community is striving for efficient cryptographic primitives that serve as building-blocks for implementations. Naturally researchers base their constructions on well-established security assumptions and derive clear security proofs to offer i.e. long-term confidentiality. This is not at least important because of the emergence of future quantum computers with powerful quantum algorithms posing a threat and needs to be addressed today before the transition of current solutions to the cloud, where the constructions would be exposed and would possibly be vulnerable to a quantum attacker.

Let's hope this trend continues and ultimately leads to reliable, standardized primitives for real-world applications.

Stay tuned for Part II: Requirements and security goals for the IoT...

Sunday, July 10, 2016

Workshop on PIR, distributed storage, and network coding at RHUL

Friday at Royal Holloway, there was a one-day workshop on private information retrieval (PIR), distributed storage, and network coding. Four speakers gave talks on topics including coding for distributed storage, coding vs. replication, and multi-server PIR schemes.

Coding for distributed storage

The first speaker, P. Vijay Kumar, gave an overview of coding for distributed storage. When a storage node fails (due to hardware failure or corrupted data, for example), there are two main issues to consider: repair bandwidth (how much data it must download to recover) and repair degree (how many other nodes must communicate with it to help it recover). We learned about which types of coding schemes are used in practice. For some applications, triple replication—without any coding—is the standard solution. The RAID 6 storage architecture uses any maximum distance separable (MDS) code that tolerates 2 failures. Facebook uses HDFS-RAID with one of two erasure codes (blog post by Facebook engineers). HDFS-RAID is a modified instance of the Hadoop distributed file system (HDFS, which replicates data three times) that uses erasure codes to reduce the effective replication of data to about 2. Windows Azure Storage uses Local Reconstruction Codes (LRC) (Microsoft blog post, paper).

PIR and coding

The second speaker, Alex Vardy, presented techniques for PIR that use coding instead of replication.

Private information retrieval (PIR) is relatively new; it was proposed in a 1995 paper by Chor, Goldreich, Kushilevitz, and Sudan. The setting for PIR is as follows. A server has a public, static database of n items (bits, for example) and a user wants to retrieve a certain one without the server knowing exactly which. Ideally, a computationally-unbounded server wouldn't be able to get any information about the user's request: the distribution of queries wouldn't depend on the index of the requested data item. But the only PIR scheme for a single database that satisfies this notion of "perfect privacy" requires the server to send the entire database in response to every query. Instead, there are computational notions of privacy for PIR (where the server is computationally bounded) and for settings with more than one copy of the database, there are other information-theoretic notions.

One interesting aspect of the multi-server PIR setting that Alex presented was that the servers are assumed to be non-colluding. There was some discussion about whether this assumption makes sense, and whether any communication at all should be allowed between the servers. For example, if all but one of the servers receives a query, will they know?

The third speaker, Salim El Rouayheb, presented some coding-based PIR schemes with a different security model: some number b of the nodes are passive adversaries who can collude. We learned about Freenet, a privacy-friendly P2P distributed storage system for sharing files (paper).

Network coding

The last talk, by Tuvi Etzion, was about network coding and its connections to distributed storage and PIR. A network is represented by an acyclic directed graph that can have parallel arcs (edges) and whose arcs each have an associated capacity. We considered two types of multicast networks. In the first, there is one source node that must distribute h messages (field elements) to n receiver nodes. In the second, there are h source nodes with one message each and n receiver nodes that need to receive all of the messages. In the scalar linear setting, each node "transforms" the messages it receives according to its local coding vector. (In the vector linear setting, nodes have local coding matrices.)

An example of a simple linear multicast network. The top two nodes are the source nodes and A and B are the messages they want to send to both of the receiver nodes (the bottom two nodes).

Tuvi shared his conjecture that for a multicast network of the first type with two messages, there is no vector solution that outperforms the optimal scalar solution. (See one of Tuvi's recent papers for related results.) It was neat that properties of multicast networks can be expressed succinctly in terms of graph-theoretic properties, like the size of a minimum cut between source nodes and receiver nodes.

I enjoyed this educational yet relaxed one-day workshop. What I found most interesting was how cloud storage providers and companies like Facebook encode their data. It's clear why they care about even small reductions in effective data replication: they want to maximize availability and tolerate hardware failures while reducing storage and communication costs. As a skeptical cryptographer, I can't help but wonder how they compose these coding schemes with encryption and other cryptographic tools to protect users' privacy.