Skip to content

Top Best Ask Show New Jobs

ZeroDB white paper: An end-to-end encrypted database (opens in new tab)

(arxiv.org)

171 pointsmwilkison10y ago38 comments

38 comments

31 comments · 11 top-level

teraflop10y ago· 7 in thread

Interesting concept. I disagree with this statement from the "Data confidentiality" section:

> Unlike CipherBase [6], our approach doesn’t leak information about the order of the elements. By observing access patterns, the server can only figure out which buckets are leaf nodes and which are branch nodes, but not whether the child node is located to the left or right of the parent node.

The paper correctly notes that when a client does a range query, it leaks information about the total number of blocks returned. However, if I've understood correctly, it also leaks the identities of those blocks. The closer two tree nodes are in order, the more likely they are to be touched by the same request. A passive adversary could use this to extract information about the relative ordering of encrypted nodes.

Also, active adversaries cause much worse problems. Let's say I have the ability to cause someone to do queries on the encrypted database, without observing the results. By simply doing separate index lookups on two different fields, I can extract a lot of information by looking at which nodes were touched by both queries.

Finally, I noticed that the paper doesn't describe how the B-trees are modified. If the client is the only party that knows how to decrypt nodes, then it's responsible for splitting and merging nodes to keep the tree balanced. Doesn't this require some kind of distributed coordination, so that multiple clients don't trample on each other's changes?

swordswinger1210y ago

Another thing from that same section that worried me was their discussion of inference attacks on first and last names. Their results assume names will be queried based on a particular distribution instead of an arbitrary (maybe even adversarially-chosen) one. This is a very, very weak model of security that categorically eliminates realistic attacks.

michwill10y ago

The thing is that anybody who is able to initiate queries has access to that data. No need to do inference attacks over there :-) So, an adversary cannot query data based on particular distribution b/c it cannot initiate meaningful queries.

If an active server-side adversary can trick the client into doing certain queries is another question though.

michwill10y ago

You are right about the identifiers. And that is exactly what makes us concerned about access patterns attacks.

They are not as dangerous when queries are rare. But we do explore possibilities to mitigate this attack vector completely using oblivious RAMs which become more and more performance.

You are also right about clients updating the b-trees. Currently we use object-level locks to keep data consistent. We may add storing deltas for buckets to keep them on the server when a conflict appears plus use additively homomorphic encryption for counters (BTree.Length object).

michwill10y ago

Also you cannot do queries on the encrypted db without being able to observe the results. E.g. if you are able to initiate a query on a fraction of the data, you already have access to that fraction of the data. And you cannot initiate a meaningful query if you are a server-side observer.

Of course, active adversaries could give wrong results (wrong pieces of the index) back, we didn't quite consider yet which problems that could cause.

teraflop10y ago

> And you cannot initiate a meaningful query if you are a server-side observer.

But that's only true if you consider the database in isolation. It's a very bad assumption to make when using it as a component of a larger system!

Let's say you build a social networking service which uses ZeroDB to store its data; I'm an attacker who has access to the database server. There are all kinds of ways I can gain indirect control over the requests that users make to the service. What if I convince two different users to open a web page containing something like "<img src='https://supersecuresocialnetwork.com/friend-list.jsp'>"? I've just triggered an operation that will cause the app servers to request tree nodes corresponding to those users' accounts, and those of their friends, at predictable times. I can correlate the requests and determine with some level of confidence whether the users are mutual friends, all without being able to see a single byte of plaintext. If the service is publicly available and I can register my own accounts, that opens up another huge category of attacks.

My point is that you're making an awful lot of assumptions about what strategy an attacker will use. If you don't have provable security against an entire class of attacks (e.g. cryptographic properties like IND-CPA) then the burden is on you to be more imaginative than whoever is trying to compromise your system.

roymurdock10y ago

Is there a reason that you did not cite Enigma in your list of notable works? They solve the adversary problem by splitting tasks among a network of nodes that post their intermediate work to a blockchain where it can be verified for accuracy:

Very recently, Baum et al. developed a publicly auditable secure MPC system that ensures correctness, even when all computing nodes are covertly malicious, or all but a single node are actively malicious [18]. Their state-of-the-art results are based on a variation of SPDZ (pronounced speedz) [19] and depend on a public append-only bulletin board, which stores the trail of each computation. This allows any auditing party to check the output is correct by comparing it to the public ledger’s trail of proofs. Our system uses the blockchain as the bulletin board, thus our overall security is reduced to that of the hosting blockchain.

I'm not a db or crypto expert, but this seems like a promising solution to the active adversary problem if you have an honest, healthy blockchain that is incentivized correctly.

geoelectric10y ago

Worth noting one of the previous posts, https://news.ycombinator.com/item?id=9262846

I raised some questions there related to your last concerns. I'm really interested to go through the current state and see how things have evolved to mitigate them.

saganus10y ago· 3 in thread

This looks very interesting.

I guess that this would be the next-best thing after FHE?.

The catch is the performance, right? I mean, the paper says:

"In order to make the database end-to-end encrypted yet still capable of performing queries, the client encrypts the buckets (at the time of creation or modification). The server, which stores the buckets, never knows the encryption key used. The objects referenced by the leaf nodes of the B-Tree indexes are also encrypted client-side. As a result, the server doesn’t know how individual objects are organized within the B-Tree or whether they belong to an index at all. Since ZeroDB is encryption agnostic, probabilistic encryption can ensure that the server cannot even compare objects for equality. When a client performs a query, it asks the server to return buckets of the tree as it traverses the index remotely and incrementally, as in Fig. 2b. The client fetches and decrypts buckets in order to figure out which buckets in the next level of the tree to fetch next. Frequently accessed buckets can be cached client-side so that subsequent queries do not make unnecessary network calls."

Which would suggest that it might be a bit slow to do many round-trips for each query.

I'm guessing there could be some specific use-cases where this is not so relevant?

I would love to have some more knowledgeable people comment on this as it would be really neat to be able to start using a DB with these features.

newscracker10y ago

> When a client performs a query, it asks the server to return buckets of the tree as it traverses the index remotely and incrementally, as in Fig. 2b. The client fetches and decrypts buckets in order to figure out which buckets in the next level of the tree to fetch next. Frequently accessed buckets can be cached client-side so that subsequent queries do not make unnecessary network calls."

>Which would suggest that it might be a bit slow to do many round-trips for each query.

I was curious and found some information through a search (key point italicized) [1]: "On the performance aspect, with a real world use case of 1GB index, just 150KB of data must be transferred on average over three requests to fetch the results back. In full text search terms, 250MB of data can be queried in around 500msec which even though slow, may not be prohibitively slow for some use cases. Insert queries also may take around the same time. The number of requests needed to fetch the query results grows logarithmically with the data size."

[1]: http://www.infoq.com/news/2015/04/Encrypt-Database-CryptDB-Z...

>I'm guessing there could be some specific use-cases where this is not so relevant?

They claim the performance degradation is not much, but that would become worse as the number of clients grows. The clients also would need to have a relatively decent amount of RAM and CPU power to handle this.

Related: CryptDB from MIT - http://css.csail.mit.edu/cryptdb/

michwill10y ago

Yep, relevant pieces, thanks for bringing this!

Our performance become a little better since then. And also we'll update our fulltext search algorithm using Lucene's practical scoring function: that makes it more scalable and will also help with performance.

As number of clients grows, things actually become better for read queries (as long as we support MVCC). However, multiple writing clients can create congestion if they write data to the same place which can affect the performance indeed.

saganus10y ago

Nice find. Thanks!

augustl10y ago· 2 in thread

Seems that this initial release only supports python. The client is obviously heavy, since it has to run all the queries, etc. Thankfully the protocol is well defined, so it's just a question of effort to support more languages.

https://docs.zerodb.io/

ex3ndr10y ago

They are our friends and i continuously asking them to make open standard instead of implementing everything on python and python database and make it work with psql or cassandra. I hope they will release something for this as we are wish to include zerodb to our encrypted messaging (actor.im).

michwill10y ago

In fact, the approach you used for actor (using GWT) is no less applicable to us (using pyjs or rapydscript - see https://medium.com/@ZeroDB_/zerodb-js-zerodb-in-the-browser-...)

swordswinger1210y ago· 2 in thread

I read the whitepaper, and while I have many questions one detail in particular jumped out at me:

"The client authenticates with an X.509 certificate or a self-signed certificate where the private key is derived from a passphrase. When a passphrase is used, the scrypt algorithm is used to derive a private key from the passphrase. Information about the salt is stored server-side and is given before establishing an authenticated encrypted connection. ... The symmetric key is derived from either the passphrase, the private key in an X.509 certificate, or provided by a key management service (KMS)."

You're deriving a private key from a user-defined passphrase, then storing the salt used to derive the password on the server, which is defined to be untrusted. I understand you use scrypt to derive the private key, but this still seems like a massive, massive footgun. Database systems are usually kept up and running for a long time, like maybe even years. If the server gets the salt, isn't it only a matter of time before even a very slow brute-force attack can re-derive the authentication credential and make whatever queries it wants?

michwill10y ago

Deriving from passphrase is optional. Mainly, for development and testing, and also for users who are ok with passphrases.

Having a passphrase + server-saved salt makes it better than a passphrase with no salt (for brute-force by dictionary).

If you think about it, security of passphrase approach is tested all the time with Bitcoin brain wallets. Over there, private keys are derived using SHA256 (which is fast to compute!) with no salt. It's hard for humans to generate good passphrases, so these wallets do that for them (usually 12 words).

But yeah, in most production usecases it would be some certificate, or KMS

swordswinger1210y ago

Yeah, this approach has been tested and has failed miserably: http://eprint.iacr.org/2016/103.pdf

0xFFFF000010y ago· 2 in thread

SQL Server has Always Encrypted - https://msdn.microsoft.com/en-us/library/mt163865.aspx

sigmar10y ago

That doesn't appear to use homomorphic encryption though

michwill10y ago

Yeah, that is actually deterministic encryption.

Microsoft had a very interesting research paper called CipherBase (we cited it), but I think they didn't end up commercializing that research

mankash66610y ago· 2 in thread

OK - So why should I use ZeroDb when it seems to be using Crypt-DB for encrypted query processing [which is the USP of ZeroDb] when cryptDb is fully open source, free and verifiable?

mwilkisonOP10y ago

ZeroDB isn't using CryptDB. And ZeroDB is open source: https://github.com/zero-db/zerodb

mankash66610y ago

Thank you for the clarification. How are you computing over encrypted data? From the "acknowledgements" section of the paper, it seems to be the same technique as cryptDB. If so, where is this source code? [I'm assuming this to be in C/C++] in the "crypto" folder as in cryptDb

derefr10y ago· 1 in thread

I can't help but compare this system to Datomic. Effectively, what ZeroDB is calling the "database" is what Datomic calls a "storage service"; and what ZeroDB is calling the "client", Datomic calls the "transactor."

In both architectures, "blocks" of data get pulled from the storage service (database) to the transactor (client), which decodes blocks into rows of tables or indexes, does a query which possibly mutates those tables/indexes, generates new blocks from the mutated tables/indexes, and uploads them.

The difference is that ZeroDB's blocks are encrypted by the client, obviously—but I'm not sure that this is a semantic difference in the architectures of the two databases. I'm wondering if an equivalent change could be made to Datomic's transactor logic, without materially affecting the way Datomic works. (The question is basically how much of Datomic's required block metadata, necessary for finding the blocks on the storage service, leaks information which ZeroDB has managed to sequester inside the encrypted blocks instead.)

michwill10y ago

Huh, interesting comparison. Thanks for pointing that out! We'll read more about Datomic: maybe we can learn something useful from them

crystaln10y ago· 1 in thread

Also the name of a popular silent disco company in San Francisco:

https://www.facebook.com/ZEROdBsf/

Clearly not overlapping business though.

michwill10y ago

Yeah, right. I guess, as long as we don't redirect encrypted content of a database to a sound device, we should be all right :-)

mwilkisonOP10y ago

BTW, you can hop in our Slack [http://slack.zerodb.io/] channel if you have any questions.

corbinpage10y ago

Being able to query/search/sort on encrypted data sounds really powerful and super relevant for this day and age.

I look forward to playing around with it.

Very cool MacLane!

StabbyCutyou10y ago

Well...

Guess I need to rename https://github.com/StabbyCutyou/0db

j / k navigate · click thread line to collapse