Firefly – Full Text Search Engine for Dropbox (opens in new tab)

Most likely the average HN reader is well aware that dropbox must have access to read your files. But possibly an average joe who doesn't read the TOS and who doesn't think everything through would assume that private storage is private to joe only and that the advertised encryption extends beyond transport to storage.

rakoo11y ago

Because you can use Dropbox as a raw storage system by encrypting stuff on the client ?

tomglindmeier11y ago

I avoided Dropbox. The news about FireFly just remembered me why I did it.

tuyguntn11y ago· 2 in thread

Today. We have indexed all of your documents, you can search easily inside your documents, even though you have created good directory structure and named your files accordingly.

Tomorrow. Hmm we have your data, lots of data, we wanted to know what is interesting to our users, so we decided to analyse them and find people with common interest.

Day after tomorrow. Miss Rice challenged us "can you find terrorist users using all of the documents you have indexed and analysed?"

Future. Hey user your first name is strange, your documents contain some strange characters, you are uploading data from country where our political leaders have problems, are you terrorist?

akerl_11y ago

You are of course welcome to not put your content on Dropbox, a private service run by a for-profit company.

They're adding a feature that they think the majority of their users will find useful, and betting that most of their users realize this doesn't increase Dropbox's access to their content: they could have written this indexing years ago and never disclosed it, using it solely for whatever tinfoil-hat conspiracy theory style acts they'd like.

If they lose the bet, they either pull the feature or go out of business. I'm betting their side, though, given that I expect most Dropbox users either don't care (which is their right) or care and made a willful decision to store their data on somebody else's systems (which is also their right).

em3rgent0rdr11y ago

Part of the concern is your average joe who doesn't read the ToS or understand computers might assume that their private Dropbox.com folder is exactly a computer analogue of a physical private folder where you drop your physical files from your physical desk, which then does some computer magic to sync them between your work and home desk.

eiopa11y ago· 1 in thread

I would've loved to hear more details about why you built your own. For example, you mention that Elastic Search wasn't deployed at your scale and there was some talk about machine footprint, but it doesn't explain how your solution compares to something like ES.

Did ES just didn't scale when you tried it? Is your solution better/faster? If so, by how much and on what workloads?

Contrast this with something like RocksDB. They just show you the numbers - http://rocksdb.org/

tracker111y ago

I think it depends partly on implementation and partly on their use case... The way ES works probably doesn't do well past a couple hundred nodes, max and even then likely has some real issues. Though they could have several independent clusters and shard out their users. What it looks like they implemented will scale better (to thousands of servers), and probably work better with their design.

That said, I'm somewhat surprised they didn't just try doing custom index/word counters against a larger cassandra cluster, which would scale well while still being somewhat out of the box as a software approach. I didn't thoroughly read through the article, but not sure of their use of stemming/mapping for word bases either.

vskr11y ago· 1 in thread

Is LevelDB linked in this article related to leveldb developed at Google ( https://github.com/google/leveldb ) ?

sandyarmstrong11y ago

The "C++ Library" link on http://leveldb.org/ points at Google's LevelDB, so yes.

dapz11y ago

"Firstly, we expect some users to have a large number of documents in their Dropbox, making it non-trivial to update their corresponding index “instantly”."

If the alternative is maintaining a single index, won't the time it takes to update it at least be the time it takes to update per-user indexes? The former naively sounds like updating a single, gigantic binary search tree, the latter seems like updating a hashmap of UserId/BST pairs.

"Secondly, this approach requires the system to maintain as many indices as there are users with each stored in a separate file. With over 300 million users, keeping track of so many indices in production would be an operational nightmare."

..Why?

Anyway the stuff about shared documents is enough to make per-user indexing probably a bad idea, but I don't understand the reasons they provided above.

cdnsteve11y ago

RabbitMQ, interesting. I was just reading up on NSQ and it seems like a good alternate.

Any details on the tech Firefly was coded in? Go, C++, Java?

georgehm11y ago

Can it do substring search? Unfortunately, Firefly is only available for business customers.

escaped_hn11y ago

If DropBox is now letting you search and index your files, then they've been doing it for months.

j / k navigate · click thread line to collapse

21 comments

21 comments · 9 top-level

majke11y ago· 4 in thread

Aren't documents on dropbox supposed to be encrypted?

nmat11y ago

It's server side encryption, they still can access your files. That's how they generate thumbnails for example.

mixedmath11y ago

Only if you're the one encrypting them.

RussianCow11y ago

They could index it before encrypting it. Still counts, right? :)

kolme11y ago

No, not really. They lied.

https://en.wikipedia.org/wiki/Dropbox_%28service%29#Privacy_...

tomglindmeier11y ago· 4 in thread

I can't help but I just don't want Dropbox or anybody else to read all my files.

jamesondh11y ago

If you really didn't want Dropbox having access to your files, why would you upload it in the first place?

em3rgent0rdr11y ago

rakoo11y ago

Because you can use Dropbox as a raw storage system by encrypting stuff on the client ?

tomglindmeier11y ago

I avoided Dropbox. The news about FireFly just remembered me why I did it.

tuyguntn11y ago· 2 in thread

Today. We have indexed all of your documents, you can search easily inside your documents, even though you have created good directory structure and named your files accordingly.

Tomorrow. Hmm we have your data, lots of data, we wanted to know what is interesting to our users, so we decided to analyse them and find people with common interest.

Day after tomorrow. Miss Rice challenged us "can you find terrorist users using all of the documents you have indexed and analysed?"

Future. Hey user your first name is strange, your documents contain some strange characters, you are uploading data from country where our political leaders have problems, are you terrorist?

akerl_11y ago

You are of course welcome to not put your content on Dropbox, a private service run by a for-profit company.

em3rgent0rdr11y ago

eiopa11y ago· 1 in thread

Did ES just didn't scale when you tried it? Is your solution better/faster? If so, by how much and on what workloads?

Contrast this with something like RocksDB. They just show you the numbers - http://rocksdb.org/

tracker111y ago

vskr11y ago· 1 in thread

Is LevelDB linked in this article related to leveldb developed at Google ( https://github.com/google/leveldb ) ?

sandyarmstrong11y ago

The "C++ Library" link on http://leveldb.org/ points at Google's LevelDB, so yes.

dapz11y ago

"Firstly, we expect some users to have a large number of documents in their Dropbox, making it non-trivial to update their corresponding index “instantly”."

..Why?

Anyway the stuff about shared documents is enough to make per-user indexing probably a bad idea, but I don't understand the reasons they provided above.

cdnsteve11y ago

RabbitMQ, interesting. I was just reading up on NSQ and it seems like a good alternate.

Any details on the tech Firefly was coded in? Go, C++, Java?

georgehm11y ago

Can it do substring search? Unfortunately, Firefly is only available for business customers.

escaped_hn11y ago

If DropBox is now letting you search and index your files, then they've been doing it for months.

j / k navigate · click thread line to collapse