https://en.wikipedia.org/wiki/Dropbox_%28service%29#Privacy_...
Tomorrow. Hmm we have your data, lots of data, we wanted to know what is interesting to our users, so we decided to analyse them and find people with common interest.
Day after tomorrow. Miss Rice challenged us "can you find terrorist users using all of the documents you have indexed and analysed?"
Future. Hey user your first name is strange, your documents contain some strange characters, you are uploading data from country where our political leaders have problems, are you terrorist?
They're adding a feature that they think the majority of their users will find useful, and betting that most of their users realize this doesn't increase Dropbox's access to their content: they could have written this indexing years ago and never disclosed it, using it solely for whatever tinfoil-hat conspiracy theory style acts they'd like.
If they lose the bet, they either pull the feature or go out of business. I'm betting their side, though, given that I expect most Dropbox users either don't care (which is their right) or care and made a willful decision to store their data on somebody else's systems (which is also their right).
Did ES just didn't scale when you tried it? Is your solution better/faster? If so, by how much and on what workloads?
Contrast this with something like RocksDB. They just show you the numbers - http://rocksdb.org/
That said, I'm somewhat surprised they didn't just try doing custom index/word counters against a larger cassandra cluster, which would scale well while still being somewhat out of the box as a software approach. I didn't thoroughly read through the article, but not sure of their use of stemming/mapping for word bases either.
If the alternative is maintaining a single index, won't the time it takes to update it at least be the time it takes to update per-user indexes? The former naively sounds like updating a single, gigantic binary search tree, the latter seems like updating a hashmap of UserId/BST pairs.
"Secondly, this approach requires the system to maintain as many indices as there are users with each stored in a separate file. With over 300 million users, keeping track of so many indices in production would be an operational nightmare."
..Why?
Anyway the stuff about shared documents is enough to make per-user indexing probably a bad idea, but I don't understand the reasons they provided above.
Any details on the tech Firefly was coded in? Go, C++, Java?