Ask HN: How to run analytics on data without access to the data?

42 pointsmichealr5y ago46 comments

I have little service for personal use, and I was considering opening it up to a general audience. Right now its processing some of my personal data for fun little personal report, in particular chat data. Since its information I have access to already, I don't mind running the program locally. What I would like to be able to do is run an analysis for anyone and return the little report that I get for myself. Without having access to their data or storing it in the first place. I know with for example oauth scopes, you can grant access, which sort of fits the criteria. But I'm thinking more exported data from an application, that doesn't have delegated access functionality

How I envisioned a solution would be some trusted third party takes my analysis script, returns the report and that is it. I never see the underlying data and recieve only one time token to access it.

I know it will never be hundred percent leak proof, and there is still a level of user trust, I realise that, but just thinking conceptually, is there any existing service out there, that does such a thing or attempts to offer something similar? Or what would an alternative approach look like?

46 comments

BelenusMordred5y ago

> I know it will never be hundred percent leak proof

A slow leaking ship will still sink. Attempts so far to anonymise public datasets have been terrible and turned into a garbage fire by attackers every time with minimal effort. Don't hand out false promises.

Guess you are looking for fully homomorphic encryption. A long-outstanding problem with lots of smart people working on it, some are doing ok at getting there.

https://github.com/ibm/fhe-toolkit-linux

dataewan5y ago

Differential privacy is an area that makes some guarantees about not letting personal information about individuals escape. Might be a useful technique as well.

https://en.wikipedia.org/wiki/Differential_privacy

Agree that strong guarantees about privacy aren't achievable.

rapnie5y ago

Yes, I found this the other day: https://smartnoise.org/

Need to read more about the concept. Anyone with more good resources?

michealrOP5y ago

Thats true, a poorly chosen description on my part.

Very cool, had read about homomorphic systems. For fully homomorphic systems has there been successful SAAS like offering allowing use of such a systems? Or do you think its still in the research oriented phase?

maest5y ago

Not really SaaS, and debatable whether it's successful, but I hear numerai uses homomorphic encryption when providing data for quants to backtest.

https://numer.ai/ https://en.wikipedia.org/wiki/Numerai

EDIT: added qualifier, since i do not know for sure if numerai is using homomorphic encryption.

kasperni5y ago

Definitely still in the research phase for what you are looking for. Performance is anywhere from 5-15 magnitudes slower than ordinary computations. Yes, magnitudes...

cipherboy5y ago

I'm a little rusty but I swear I saw a partial homomorphic encryption scheme for aggregates and analytics. I want to say Enigma conference, '16 or '17? Maybe by Boston University.

The benefit being that while you can run any computatio with a FHE, PHEs are generally faster.

IIRC Microsoft was also doing research on PHEs.

michealrOP5y ago

Interesting, did Microsoft bring the research any further?

1 more reply

meowface5y ago

Your best bet is probably to just do all the processing locally in the browser. The issue is 1) from most end users' perspectives, they have no idea if it's actually running locally or talking to a server, or how to verify it, or probably what that difference even means in the first place, so a skeptical user won't necessarily gain that much additional peace of mind, and 2) hypothetically a compromise could still result in the local data being siphoned off by an attacker. The latter's still a risk for regular desktop applications, but a bit less so (since you can get a signed binary).

The homomorphic encryption approach probably isn't worth the effort. There's always going to be a trade-off between doing something useful and sufficiently/securely obfuscating/anonymizing the data. So I'd recommend the local approach, with a prominent explanation of how you don't and can't see any of the data.

hunter2_5y ago

This could work if the analytics engine is free and (ported to) JavaScript, but not if it's closed source. In the latter case, a trusted third party (escrow, one might call it) as OP described does seem like the way to go.

The problem is, why would end users trust the third party more than the analytics developer? Are there companies that specialize in being this third party and have amassed mutual trust of the general public (akin to a notary public) for handling data and code without leaking either?

michealrOP5y ago

Analytics wise, I'm ok with being restricted, other commenters have mentioned looking at WASM as a possible workaround. So local does seem to make the most sense, practicality wise

A thought, the possible scope of services in the data notary or data escrow side of things does seem like an underexplored product category.

1 more reply

nindalf5y ago

It doesn't need to be in JavaScript. Any language that can compile to WebAssembly would work too. But I agree with the broader point - the code needs to execute on the client, not the server.

satyrnein5y ago

Maybe a browser extension with limited permissions? Say the tool looks at Slack and counts how often you use the ROFL emoji. The extension could be granted access to *.slack.com but no other domains.

meowface5y ago

That could work, but then you have the additional barrier of having to convince people to install your browser extension, and for people who are already worried about privacy, that comes with its own can of worms. Especially if they don't necessarily understand or trust the permission model.

franky475y ago

I asked myself a similar question for web analytics a year ago [1]: how to provide a service without having access to the underlying data. It requires shifting the processing onto the client side, so it limits what you can do, but it's best for privacy, and security (since the data never leaves the native app or browser).

[1] https://chiffre.io

dumbfounder5y ago

Client side is the first answer, but is there a second? Is there a way to peer review a piece of code that can run in a 3rd party container (peer review and cryptographically signed), such that the actual container running the code is encrypted itself and can run anywhere?

I am imagining you download the "container", put the data in, encrypt the container with the data inside, and have that run anywhere.

But I have no idea if that is possible.

michealrOP5y ago

I wonder myself the same thing.

Thinking through issues, the external script could still repeatable run on the hidden data, slowly building an idea of the information. There are techniques like homomorphic encryption that go in the direction of allowing analysis on encrypted data.

Musing on possible other solutions, I wonder if simply ratching up the cost and repeated access and limiting data output would discourage this profile building.

Another possibility is it possible to concieve of the service, that takes in a script, runs it, and then tests the returned data for the level of information entropy. Blocking anything above a certain threshold. FYI not sure if that is complete nonsense, but conceptually, with much hand waving, maybe it works.

Going local though does help too

noisenotsignal5y ago

It's not really "run anywhere", but you can write apps for a trusted execution environment like Intel SGX enclaves; not even the OS can look at what's running. Enclave code is cryptographically signed so that you can both validate the identity of the signer as well as the code contents. In the latter, you'd have to compare the MRENCLAVE value to a published value, which you could reproduce by building from source if it's open.

Microsoft calls this "confidential computing" and has some related Azure products, including providing VMs standalone and in Kubernetes.

franky475y ago

That would be feasible with homomorphic encryption, however current implementations are very far from practical applications (extreme resource consumption, terrible performance).

1 more reply

jhoechtl5y ago

Doable if you think client side of Ethereum or IPFS

stelfer5y ago

Take a look at Google Private Join and Compute[1]. But be aware that the problem you frame is an unsolved research problem with an active global community. The topics you are looking for are applications of secure multiparty computation and homomorphic encryption. Also, be ready for something as simple as a column join to take 24 hours per query.

[1] https://github.com/Google/private-join-and-compute

rjmunro5y ago

This reminds me of https://opensafely.org/, which analyses NHS medical data for research purposes by asking Doctors to run queries on their patient databases and send back only summaries, e.g. "How many of your patients with HIV also had Covid19" https://github.com/opensafely/hiv-research

gopty5y ago

https://www.darpa.mil/program/programming-computation-on-enc...

Syzygies5y ago

https://mathscinet.ams.org/mathscinet/help/about.html "MathSciNet® is an electronic publication offering access to a carefully maintained and easily searchable database of reviews, abstracts and bibliographic information for much of the mathematical sciences literature. Over 125,000 new items are added each year..."

The stakes are lower when money, not privacy, is at risk. I have attempted to argue for years that the MathSciNet catalog of the mathematical literature should be open to all forms of machine learning and mind mapping software experiments. It remains a cash cow for the American Mathematical Society, and they're fiercely proud of its human curation by 19th century methods. Meanwhile, mathematicians continue to believe that math remains separated into tribes, with number theorists lobbying to hire their own at departmental meetings. The true connections between ideas defy these ancient categories. I see a generation of potential advances squandered by not letting third-party tools in to study MathSciNet.

The right ideas could help here. One isn't protecting individual privacy, just a cash cow. The bar is lower.

syats5y ago

I'll tell you about International Data Spaces Assocation, just for the sake of completeness, and because others have mentioned some sort of certification of apps, etc. Finding a general solution to the problem posed by OP is quite difficult, as it requires a lot of extra infrastructure, technical and non-technical.

One idea would be:

1. distribute to the data owners a base system (something that can "run" stuff on their premises). People here have mentioned browsers, but for a more intensive processing this might not be enough.. so think of a docker daemon, keys for some docker registries, etc.

2. have a trusted "app store" (e.g. a docker registry where images are built in a reproducible manner from code which is inspected and certified, and then are cryptographically signed)

3. make a well described interface to the apps to consume the data (thinking of the general use case here.. if you just want to analyze fb info then you can make an adhoc parser...)

4. Have the data owner download, check the signature of, configure and run the app on their premises.

Things get even more interesting when the analytics need data from different non-trusting partners, so that Homeomorphic Encryption becomes necessary.

There is at least one specification that aims at supporting all of this: https://www.internationaldataspaces.org/wp-content/uploads/2... although implementation is, so far, lagging behind.

alfl5y ago

We [0] are getting quite far decomposing algorithms symbolically and then doing some fancy footwork with private set intersection. It ends up being better/faster/cheaper than homomorphic in a lot of use cases.

Shoot us a note -- would love to hear more details.

[0]: https://proofzero.io

amai5y ago

It sounds like Federated Learning might be of interest for you:

https://federated.withgoogle.com/ https://en.wikipedia.org/wiki/Federated_learning https://github.com/poga/awesome-federated-learning

cedricd5y ago

There's another approach you can do -- make the analysis portable instead.

Assuming data is in a standard format then you can share your script for people to run themselves. Obviously this is fairly difficult in practice unless you can bundle everything into a client-side script on a website.

For reference Narrator [1] does this -- it puts data into a standard format so that analyses written for one company can be run for another. I'm not suggesting you build your stuff on that platform, but it's an interesting approach that does exist.

[1] https://www.narrator.ai

jedimastert5y ago

Either the first party (i.e. the client) runs the data on their own turf or they hand the data to someone else (you or whatever third-party you use) and trust that the other end is going to treat your data right.

I'm sure there's some sort of homomorphic encryption[0] magic scheme that might let you process the data on other servers or something, but I could not even begin to tell you how. Really, it's just trust.

brian_spiering5y ago

Differential privacy is the field of study for sharing sensitive data in a way that allows analysis while retaining some guarantees of privacy.

lmkg5y ago

Agreed, Differential Privacy is the name for this problem.

Quick summary of important results: You will always leak a small amount of information. But it is possible to bound this leak to whatever level you consider "acceptable." The trade-off is statistical validity of the results (the usual approach adds "noise" to the data and/or analysis).

JosephRedfern5y ago

How is the service written? I'd look to compile it down to WASM or otherwise run it in the browser, if possible.

michealrOP5y ago

Client side does make sense. I guess the user could upload there chat data zip file to the client side app. Which then locally would do the processing. The report itself could be saved, but not the data.

gostsamo5y ago

Adding the third party only complicates the issue because the user will have to trust you and the proxy, and the proxy will have to trust your code. Best case, let the user download your code as a mobile or desktop app and run the analysis themselves.

tjanez5y ago

You might want to check out Oasis' Parcel SDK: https://www.oasislabs.com/parcelsdk.

jhoechtl5y ago

What about Fully Homomorphic encryption? Would a FHE scheme enable to discover patterns without seeing the data?

michealrOP5y ago

I theory I assume it would, my bottleneck would be just knowledge. Just don't know enough about FHE to comfortably work with it. FHE as a service would be my little mini dream.

jhoechtl5y ago

There used to be an MIT CISAL research project I find no more traces which had as a project goal to establish FHE as a service.

Apparently it failed.

sgt1015y ago

Could you send your code to their execution environment for a one time run (unlocked with a code?)

j / k navigate · click thread line to collapse

46 comments

BelenusMordred5y ago

> I know it will never be hundred percent leak proof

Guess you are looking for fully homomorphic encryption. A long-outstanding problem with lots of smart people working on it, some are doing ok at getting there.

https://github.com/ibm/fhe-toolkit-linux

dataewan5y ago

Differential privacy is an area that makes some guarantees about not letting personal information about individuals escape. Might be a useful technique as well.

https://en.wikipedia.org/wiki/Differential_privacy

Agree that strong guarantees about privacy aren't achievable.

rapnie5y ago

Yes, I found this the other day: https://smartnoise.org/

Need to read more about the concept. Anyone with more good resources?

michealrOP5y ago

Thats true, a poorly chosen description on my part.

maest5y ago

Not really SaaS, and debatable whether it's successful, but I hear numerai uses homomorphic encryption when providing data for quants to backtest.

https://numer.ai/ https://en.wikipedia.org/wiki/Numerai

EDIT: added qualifier, since i do not know for sure if numerai is using homomorphic encryption.

kasperni5y ago

Definitely still in the research phase for what you are looking for. Performance is anywhere from 5-15 magnitudes slower than ordinary computations. Yes, magnitudes...

cipherboy5y ago

I'm a little rusty but I swear I saw a partial homomorphic encryption scheme for aggregates and analytics. I want to say Enigma conference, '16 or '17? Maybe by Boston University.

The benefit being that while you can run any computatio with a FHE, PHEs are generally faster.

IIRC Microsoft was also doing research on PHEs.

michealrOP5y ago

Interesting, did Microsoft bring the research any further?

1 more reply

meowface5y ago

hunter2_5y ago

michealrOP5y ago

Analytics wise, I'm ok with being restricted, other commenters have mentioned looking at WASM as a possible workaround. So local does seem to make the most sense, practicality wise

A thought, the possible scope of services in the data notary or data escrow side of things does seem like an underexplored product category.

1 more reply

nindalf5y ago

It doesn't need to be in JavaScript. Any language that can compile to WebAssembly would work too. But I agree with the broader point - the code needs to execute on the client, not the server.

satyrnein5y ago

Maybe a browser extension with limited permissions? Say the tool looks at Slack and counts how often you use the ROFL emoji. The extension could be granted access to *.slack.com but no other domains.

meowface5y ago

franky475y ago

[1] https://chiffre.io

dumbfounder5y ago

I am imagining you download the "container", put the data in, encrypt the container with the data inside, and have that run anywhere.

But I have no idea if that is possible.

michealrOP5y ago

I wonder myself the same thing.

Musing on possible other solutions, I wonder if simply ratching up the cost and repeated access and limiting data output would discourage this profile building.

Going local though does help too

noisenotsignal5y ago

Microsoft calls this "confidential computing" and has some related Azure products, including providing VMs standalone and in Kubernetes.

franky475y ago

That would be feasible with homomorphic encryption, however current implementations are very far from practical applications (extreme resource consumption, terrible performance).

1 more reply

jhoechtl5y ago

Doable if you think client side of Ethereum or IPFS

stelfer5y ago

[1] https://github.com/Google/private-join-and-compute

rjmunro5y ago

gopty5y ago

https://www.darpa.mil/program/programming-computation-on-enc...

Syzygies5y ago

The right ideas could help here. One isn't protecting individual privacy, just a cash cow. The bar is lower.

syats5y ago

One idea would be:

2. have a trusted "app store" (e.g. a docker registry where images are built in a reproducible manner from code which is inspected and certified, and then are cryptographically signed)

3. make a well described interface to the apps to consume the data (thinking of the general use case here.. if you just want to analyze fb info then you can make an adhoc parser...)

4. Have the data owner download, check the signature of, configure and run the app on their premises.

Things get even more interesting when the analytics need data from different non-trusting partners, so that Homeomorphic Encryption becomes necessary.

There is at least one specification that aims at supporting all of this: https://www.internationaldataspaces.org/wp-content/uploads/2... although implementation is, so far, lagging behind.

alfl5y ago

Shoot us a note -- would love to hear more details.

[0]: https://proofzero.io

amai5y ago

It sounds like Federated Learning might be of interest for you:

https://federated.withgoogle.com/ https://en.wikipedia.org/wiki/Federated_learning https://github.com/poga/awesome-federated-learning

cedricd5y ago

There's another approach you can do -- make the analysis portable instead.

[1] https://www.narrator.ai

jedimastert5y ago

brian_spiering5y ago

Differential privacy is the field of study for sharing sensitive data in a way that allows analysis while retaining some guarantees of privacy.

lmkg5y ago

Agreed, Differential Privacy is the name for this problem.

JosephRedfern5y ago

How is the service written? I'd look to compile it down to WASM or otherwise run it in the browser, if possible.

michealrOP5y ago

gostsamo5y ago

tjanez5y ago

You might want to check out Oasis' Parcel SDK: https://www.oasislabs.com/parcelsdk.

jhoechtl5y ago

What about Fully Homomorphic encryption? Would a FHE scheme enable to discover patterns without seeing the data?

michealrOP5y ago

I theory I assume it would, my bottleneck would be just knowledge. Just don't know enough about FHE to comfortably work with it. FHE as a service would be my little mini dream.

jhoechtl5y ago

There used to be an MIT CISAL research project I find no more traces which had as a project goal to establish FHE as a service.

Apparently it failed.

sgt1015y ago

Could you send your code to their execution environment for a one time run (unlocked with a code?)

j / k navigate · click thread line to collapse