Show HN: I made a neural net that analyzes privacy policies (opens in new tab)

(useguard.com)

502 pointsrameerez6y ago120 comments

120 comments

98 comments · 45 top-level

rameerezOP6y ago· 20 in thread

Hi guys!

So as basis for my thesis on AI and NLP I've been working on a RRN-based text classifier that basically reads and analyzes privacy policies. It understands that "we don't share your data with third parties" is privacy friendly while "we may share your data with anyone" is a potential threat.

I've then created this website with a bunch of analyzed services to showcase the most relevant info about each service along with other interesting stuff like recent data breaches or instructions to delete your account in said service.

Happy to answer Qs about the tech behind, it'd also be great to hear your feedback on what the site lacks and possible improvements!

thecleaner6y ago

Oh my god thank you so much for doing this !

I think its better (for me as a user) if you don't boil things down to a score as different people expect different things when talking privacy. It would help if you could simply highlight the potential problematic clauses in different privacy statements along with some reason why it might be problematic.

nickodell6y ago

I don't agree. I just looked in my password manager, and I have roughly ~220 accounts across the web. If I want to go through that list and see which website rank well and which rank poorly, and I want to do that in under two hours, that gives about 30 seconds per service.

In other words, giving a single score plus a two-sentence highlight is probably about the right amount of information.

4 more replies

blondin6y ago

well, just checked and i already see the majority of the web gathering around "C". so really, we can argue both ways about this score thing...

1 more reply

epoch_1006y ago

For a site trying to fight for privacy, don't you think it would be better to not use Google Analytics to track the people who visit your site?

rameerezOP6y ago

Yes, definitely. I mention it in Guard's own privacy policy, I don't like using it either, but reasons are: (a) it's the simplest and as far as I know one of the few free analytics tools available, (b) not having a measure of the website activity will make me effectively blind and unable to make decisions, (c) I don't send any personally identifiable event (and, for this matter, I don't send any events apart from page loaded events). I'm also open to suggestions to replace GA.

3 more replies

brianberns6y ago

How did you create a data set large and accurate enough to be useful in training a model?

rameerezOP6y ago

Some friends run an AI bootcamp and helped me finding the initial set of users to help me with labelling. Initial labelled data was generated mostly through them, both manually labelling and with the approach described in https://useguard.com/experiment Also, the model I'm using relies heavily in transfer learning and achieves very reasonable results with few labelled items (the paper in which the technique is described actually maintains that with only 100 labelled examples they reached comparable results to using 10x that data in models that use older approaches)

1 more reply

joshspankit6y ago

Important work, and this seems to be doing a decent job already. Cheers. One thing about the teaching: some sentences don’t have anything to do with privacy, so there might be a button to train the AI on that.

jsingleton6y ago

Looks cool. It's a small point, but not pluralising words when the number is 1 shows attention to detail (e.g. 1 scandals for instagram).

rameerezOP6y ago

Thanks for the heads up!

hanoz6y ago

> It understands that "we don't share your data with third parties" is privacy friendly while "we may share your data with anyone" is a potential threat.

Does it understand "we don't share your data with just any old third parties", or "we're not like our competitors who may share your data with anyone"?

davidkuhta6y ago

Not OP, but curious, are those quotes actually from privacy policies or just hypotheticals?

1 more reply

domnomnom6y ago

How do we know it isn't just people doing the analysis if we can't actually use the AI ourselves?

rameerezOP6y ago

I'm most probably publishing a paper later this year detailing the process. Also, 80 pages of my thesis would have loved this wouldn't have involved AI to make the whole thing simpler :)

Pete-Codes6y ago

Very nice! The fact you mentioned Tinder's T+C on Twitter got my attention.

goldemerald6y ago

Does your work integrate pretrained LMs like BERT or GPT2?

rameerezOP6y ago

Yes, but not Transformer-based like these two, rather LSTM-based like ULMFiT

2 more replies

godelmachine6y ago

>>”along with other interesting stuff like recent data breaches”

Do you have an integration with HaveIBeenPwned?

crusty5116y ago

> Happy to answer Qs about the tech behind [...]

What tech is involved to get something like this on the web?

burtonator6y ago

Hey... can you reach out to me... I'm kevin ... at datastreamer.io (trying to hide that from spam but I think you can avoid that).

I'm working on something similar at the moment for a client.

Right now I'm just starting out but I've built a privacy policy classifier that is an RNN classifier based on TensorFlow that is just an 'is this a privacy policy' classifier.

I have about 650MB of privacy policies at the moment which I fetched via a crawler. I'm just about to classify the rest of them.

I'm trying to automate the whole thing so that we have a full workflow.

Anyway... ping me to discuss.

the_pwner2246y ago· 4 in thread

I went to the homepage and ctrl-clicked a company card (Mozilla) to open the details in a new tab. Instead, the site jacked my ctrl-click and instead tried to navigate to the link in the same tab. Middle-clicking does not work at all.

It then went to an error page instead of loading the details for Mozilla, but while it's an interesting idea I'm not sure how useful it is. I don't usually create an account on a website unless I have to do so, and the privacy policy is nonnegotiable. So why would I want to check what their privacy policy is?

I only give personal data to websites when I have to (e.g. to services that work or school use) or if I already trust the company to not do anything shady with it (Mozilla has done some sketchy stuff but I believe they won't leak my passwords).

And for websites where participation is more optional, like HN or Reddit, you don't usually need to give much personal data anyway.

Edit: the website is fully working now. Mozilla has had one security breach where emails and hashed passwords were leaked, in 2014. At the bottom the sentence breakdown is 2.5/12/22% concerning/mild/friendly. Meanwhile Reddit has no breaches, but keeps your messages forever and shares data with ad companies. Their sentence breakdown at the bottom is 3.3/23/9%. Overall, the AI rates Mozilla at 33% and Reddit at 41%. That doesn't really make sense to me.

I would really like to see more details about the privacy policy sentences on the website. If 2.5% of Mozilla's privacy policy is very concerning and 12% is mildly bad, I would like to see the actual sentences to know the risks. There is a button to view the full annotated policy, but clicking it says to send an email to you. Edit: this seems like a bug, it shows a few sentences in a WebKit-based browser [Falkon] but in Firefox it just shows the chain link icons.

Finally, I took the A/B test from Guard, and quite of a few A/B choices seemed to not really have anything to do with privacy. If the dataset is kept the same, then I think a different test format would be to rate each A and B snippet as:

- Not about privacy

- Good for privacy # only if the other one is not about privacy

- Bad for privacy # only if the other one is not about privacy

- Better than [A/B] # only if neither is not about privacy

Anyway, the data itself may be somewhat useful to me if I want to learn more about a company's privacy practices. But for normal people, I think it would be helpful for the website to also explain why privacy is important and why people should care.

simongr3dal6y ago

More and more websites are so advanced that they can’t even use an <a> tag anymore. Instead they do some convoluted onclick-scripting that breaks all standard behavior and accessibility functionality.

notyourwork6y ago

I think complicated is a better choice than advanced. Intentionally complicated without adding value in a lot of cases.

1 more reply

gempir6y ago

I think this is mostly a problem of the tools used. An anchor tag by definition is an inline element, so it shouldn't really be a giant box that's clickable, so you default back to an onClick.

onClick in popular frameworks just means left click and nothing more, which makes sense, except for that use case of opening in a new tab with middle mouse button or right clicking etc. So you have to add a lot of logic to support all that.

If you break the html spec and make an anchor tag a block element then you have to deal with catching and stopping the event from it otherwise it would work as a normal link but you actually just want to change state in your JS app.

So I think tools like Angular, React, Vue etc. should get a better way to create links on website that just change state .

2 more replies

Swizec6y ago

> I only give personal data to websites when I have to

This is a great policy that I think more people should use. Not everything needs your real name, real birthday, or your real home address. Definitely not your real phone number. You often do need a real email address.

> you don't usually need to give much personal data anyway

This is where it gets tricky. Anonymized aggregate data can be surprisingly identifying. You only need 33 bits of information to uniquely identify any individual in the world. If your IP tells me you're from San Francisco, then I need just 20 bits of information to uniquely identify you.

Data mining 20 yes/no answers about one of your users is ... pretty easy.

epoch_1006y ago· 4 in thread

Hi. A related project that takes a more human-powered approach is PrivacySpy (https://privacyspy.org). Would be neat to see how these tools intersect.

PrivacySpy is open source, community run, and more about grading policies on a standardized rubric (as opposed to entrusting that to ML), so these tools might complement one another.

(Full disclosure: I'm a contributor to PrivacySpy.)

jchook6y ago

Another one: https://tosdr.org/

epoch_1006y ago

ToS;DR is great, although it's more focused on terms of service so if you're looking for privacy-only info, you'll have to cut through a bit of noise.

abathur6y ago

On the off chance the various responders in this sub-thread see this comment, for a while I've hoped someone would advocate for privacy/TOS policies to follow a similar model to OSS licensing.

Normal people could conceivably read and understand a given policy if the knowledge scaled.

Any substantial adoption would help focus effort/resources on services that deviate from the terms.

Sites that are already repositories of this knowledge could play some part in codifying best-practices, advocating for adoption, and tracking progress.

shlant6y ago

We have something similar as well over at https://privacy.commonsense.org/.

We have taken different ideas from many different implementations and applied them more specifically to ed tech products.

the_watcher6y ago· 4 in thread

Telegram has a 105% score? Is that expected or a bug?

rameerezOP6y ago

Bug, one of the components of the score is not properly normalized I think. I'll fix it as soon as I handle the traffic overload :)

rameerezOP6y ago

It should be pretty close to 100%, though, they seem to be super privacy friendly!

the_watcher6y ago

Yep, I read their policy. I just wasn't sure if extra credit existed.

contravariant6y ago

>BIGGEST THREAT: «We never delete your funny cat pictures, we love them too much»

I don't know, seems like a GDPR violation to me ;-).

nerdponx6y ago· 3 in thread

This is great, I've been wanting to do a project like this for a while.

Would be great to get some insight into your data collection/labeling and model design process.

rameerezOP6y ago

Some of the process on gathering the data to create the labelling dataset is described here: https://useguard.com/experiment

But most probably I'll be publishing a paper later this year detailing all the details and process :)

underanalyzer6y ago

Very cool idea! One question I couldn’t seem to find the answer to on your site was are the policies featured on your a/b training exercise distinct from the polices that the ai grades? For example will a user going through your a/b trainer ever see a snippet from the Instagram privacy policy?

1 more reply

ignoramous6y ago

Where can we follow you to know abt the paper when you do eventually publish?

1 more reply

paktek1236y ago· 3 in thread

I get ERR_SSL_PROTOCOL_ERROR

rameerezOP6y ago

Some other user reported that same thing this morning but I couldn't find any explanation to this err. It basically works for everyone except for these two precise cases. One idea I have is that you might be behind some sort of firewall that's blocking my website (because in the past either the IP or the domain got flagged by some antivirus company and now some business networks block it) – might this be the case?

jazzyjackson6y ago

I'm not an expert but I wonder if this error comes up in the case of the HTTPS handshake not being able to agree on a protocol -- one side of the transaction is trying to insist on a crypto protocol that's out of date or a little too fashion-forward?

Just a thought.

nl6y ago

When reporting this, it's useful to include your browser and exact version number. Most browser ship updated SSL cert packs with new versions, and debugging can be hard without this information.

foxes6y ago· 2 in thread

The "game" (training) that asks you to analyse "privacy threats" is a bit strange. It feels like it takes two random excerpts from a privacy policy and asks you to compare them, but with this, it feels like it is missing some global information, you are just looking at local details.

Eg one policy might be disclosing what they do (but its actually relevant to collect data, eg password manager) while the other just says "no we don't collect anything". In this case one feels like its a better option, but its not exactly the same situation, its missing some context. I feel like this could potentially bias ratings.

I'm not sure if you could add in extra information with some of that global information, eg the type of service, classifying different "parts" of the privacy policy etc.

raynr6y ago

Yeah, it's not comparing like for like. Feels like the system is trying to collect training data from users.

To elaborate on your last sentence, context is critical in assessing whether a clause is pro- or anti- privacy. Is the collection of information critical to the provision of the service? What is collected, and how much? And so on.

mceachen6y ago

Agreed with parent and gp. I gave up with the 10 questions, as the sentence comparisons were almost comically incomparable. I fear your model is going to be a random number generator.

1f60c6y ago· 2 in thread

I’m trying to help teach the AI, but some options don’t have to do with privacy at all.

For instance, I got (option A):

  The Games Press Web site can, optionally, store a Cookie on your computer in order to automatically log you into the site on each visit.

and option B:

  Back to Top ^ NO THIRD-PARTY BENEFICIARIES There shall be no third-party beneficiaries to this Agreement.

It would be great to have a skip/flag option for cases like these.

Edit to add:

Some other notes:

  - Mozilla, a tech company I consider ethical, is right down there with Netflix, LinkedIn and Waze
  - The box under “Sentence Breakdown by Risk Level” is empty when my ad blocker is enabled (Adguard on iOS Safari)
  - Telegram, a company I also consider ethical, has a score of 105%—is this an oversight?

quickthrower26y ago

And sometimes you get two privacy friendly policies and I'd like to say "equal".

After a few of those I picked one at random and then it puts me onto some kind of 'Game' which made me feel they were trying to train me instead of vice-versa. The game didn't respond to clicks so I closed the app.

krageon6y ago

The fact that you consider companies ethical is neither here nor there for the privacy score of their policy though. You haven't made a very strong case as to why your opinion needs to impact the metrics (or really given any justification for your feelings at all).

As to your two options having nothing to do with privacy: The fact that there cannot be any third-party beneficiaries is in fact a baldly privacy-friendly statement, because a third party is yet another party that may influence a privacy policy in a net-negative way.

nurettin6y ago· 2 in thread

offtopic: Is it worth using producthunt for games? Or do we concentrate our efforts on platforms like steam, appstore and play store?

lucb1e6y ago

Why are you posting that to this thread? I know you said off topic, but usually when someone says that, it's still related: e.g. if the parent comment mentioned something tangential, or I could imagine if the link was to producthunt and you wondered about games on that platform... are you hijacking this thread for a completely unrelated question that you want to ask the HN audience, or is there a connection I'm missing?

nurettin6y ago

They are on producthunt.

1 more reply

carbocation6y ago· 1 in thread

Hmm, I was trying to go through your A/B options but it didn't seem to register any click. So I started clicking repeatedly. Then, it processed those clicks on the first 7 items, giving you bad data. FYI.

rameerezOP6y ago

This just hit the frontpage so the server might be a bit overloaded, I'm sorry. Trying to resize resources right now. Thanks for the heads up, in the long term (ideally) noise shouldn't be a huge problem in a sufficiently large dataset (or at least I'm already expecting some noise haha)

olafure6y ago· 1 in thread

Tinder and Mozilla have the same score of 33%. I don't agree with that. Tinder willingly shares very, very personal data along with your contacts (assuming without said contact consent).

scarlac6y ago

Keep in mind that it's analyzing their privacy policy, not their actions. Who a company specifically choses to share the data with and how often they do it is likely not considered.

michaelaiello6y ago· 1 in thread

Hello!

http://www.privacyparrot.com/

Made a similar project many moons ago and is still kicking along. Thanks all on HN for the feedback.

ignoramous6y ago

I used it as recently as a month ago! Thanks.

R888D06y ago· 1 in thread

You're doing incredibly noble work. Ive bookmarked the site and will make great use of it. Thank you for working hard to protect data privacy, I hope you never stop.

rameerezOP6y ago

Thank you for these words :)

Angostura6y ago· 1 in thread

I'm slightly puzzled that Telegram is scoring more than 100%. What am I missing?

rameerezOP6y ago

A bug in the scoring algorithm that I need to fix

A4ET8a8uTh06y ago· 1 in thread

I will join the chorus of great work. I love it. It may even make some people more privacy conscious ( very few people read those -- usually the ones who wrote it ).

rameerezOP6y ago

Thank you! :) I've read some recent research and looks like this is actually measured: only 0.001% of all internet users start reading them (and even a smaller amount of people likely finish reading them). On top of it, if you had to read all the privacy policies you accepted only on the past 5 years alone, you would have to use 3.040 hours of non-stop reading. Crazy. Love your privacy-oriented username btw! ;)

artificial6y ago· 1 in thread

This is really cool, what prompted the idea to combine the two?

rameerezOP6y ago

Thanks! Last semester I did a bootcamp on Artificial Intelligence and I had to do a final project. Last year I started becoming concerned about digital privacy when I discovered Facebook had an updated copy of your phone contacts including nicknames [1] (which basically means strangers at Facebook know the names I call my GF). I later found out this was explicitly said in FB's privacy policy. So when I did the bootcamp and discovered how powerful RNNs are to model the complexity of the English language I came up with the idea.

[1] https://news.ycombinator.com/item?id=16661735

verbify6y ago· 1 in thread

Great work! Would also work as a browser plugin.

rameerezOP6y ago

This was actually one of the ideas to evolve the project! Another one is making an app that protects your digital privacy from these threats, kinda like an antivirus but for privacy threats instead of viruses (https://useguard.com/blog/future/) Would love to hear feedback on what should this project become next :)

helloiloveyou6y ago· 1 in thread

Very creative project! Congratulations!!

rameerezOP6y ago

Thank you! :)

greggman26y ago

Analyse LastPass. Even without ML it's clearly saying they spy on everything you do and share it with anyone they want. I'm surprised they get recommended so often given their privacy policy.

https://www.logmeininc.com/legal/privacy

the_watcher6y ago

This is interesting, but I wonder what a boilerplate privacy policy would score. Given the clustering of scores between 30 & 50% for what read like "our lawyers pulled the standard policy and billed us for 4 hours of minor tweaks", it seems like some of the most effective privacy advocacy would come by challenging the most common "dangerous sentences" in court.

tboyd476y ago

Hi there. This is really neat. It reminds me of a talk I listened to recently on digital privacy, where the guy was using the price of "privacy products" as a way to measure how much people value their privacy. This seems like it would be one of those.

Did you find that there's one single variable, like length or the presence of certain words, that the system relies on heavily?

BrS96bVxXBLzf5B6y ago

This is a good tool, but the execution of the website is disappointing.

1. Only showing excerpts of the highest threat levels. Trying to view the less severe threats asks us to email in. If you're willing to volunteer the information, why the hoops?

2. "Play a short game to continue using this tool" ensures I'm not going to share this with anyone. Putting a stranglehold on users is _never_ the way forward. I might have volunteered my time if I were at home and browsing through. I can't when I'm quickly flicking through during taking a five minute break from looking at work. But it's left me with a final negative impression before being unceremoniously blocked off.

lab000026y ago

Holy Crap , Tinder shares your profile with potential employers ( or rather companies contracted by those employers).

So glad I gave up on online dating a long time ago.

But really this is amazing.

The next step would be be to have a lawyer write a small opinion piece on the most popular sites.

murukesh_s6y ago

Cool tech and good use of NLP. but isn't the privacy policy system entirely broken? It's like driving on unmarked, unpaved roads, why don't we have a global template that's list in the beginning a checklist that is human understandable/comprehensible quickly like - Do we share your data : (Y/n)

Etc

Without something like that the problem is the companies could change the wordings and the neural net could not detect until trained again which is potentially more dangerous!

It's time we get some standard for user web privacy policy docs like gdpr

martin__6y ago

Hello, if the webmaster for this site is reading this, your `change.org` file is getting a Content-Type of `application/octet-stream` instead of `text/html`, which is giving me (in firefox) a prompt to download a file instead of displaying the page.

The problem is probably with your type mappings: https://nginx.org/en/docs/http/ngx_http_core_module.html#typ...

19ylram496y ago

To the creator:

I actually just went through the trouble of resetting my Product Hunt account (I haven’t been on there in a long time) just to give you that upvote!

Thank you for this! Cheers.

andrerm6y ago

Great work. Thank you. If I may, don't show bad scored results in the first page. Bad advertising is still advertising, and we all assume they are all bad anyway. I understand the surprise people have a first but the next step is finding good apps. Alsi sort by grade and search by name is a must.

Again, thank you for your time and work.

Edit: telegram had hacking news this year, search for "Telegram voicemail account hijacking"

brenden26y ago

The need for things like this is partly why I quit my job 6 months ago to start my own company (in profile). We need to start building companies and products that provide valuable communities and services (like social networks) without the need for ads/privacy violations. My belief is that the issue is largely related to incentive misalignment (users != customers).

Congrats on the launch!

VvR-Ox6y ago

Though it's nice you built this I'd wish we wouldn't need it because of governments who do their job and protect the people they were originally intended to serve.

How long can we keep up arming us in the battle against powerful and rich entities who steal our data and buy politicians to have direct access to the process of law-making?

eleen6y ago

Hi, I like what you did. I am a trained German lawyer working with NLP at a Computer science faculty. I would like to talk about this with you, as I am very interested in this topic. Also, I ve been working for the German Data protection Agency...

a2x6y ago

Letting people doing the training without any knowledge and context seems to makes no sense.

How do you ensure, that it isn’t a bot, or an army of bots?

What is the neuronal network doing?

How do you unsure your personal integrity, those of your team members and the overall integrity of your ‘system’?

nuc6y ago

Where is Facebook? I guess that would be one of the most interesting policies to analyze.

dastx6y ago

Great website. Quick note - your site doesn't functions without javascript. Having enabled it and trawled through it I see no reason to require js to be enabled. Adding a non-js fallback would be great.

29athrowaway6y ago

A great idea. Some ideas for features:

- Keep a timeline of privacy policy changes, being able to compare scores between 2 versions.

- Subscribe to notifications for changes in score.

- Browser extension that shows you the score on the site.

aledalgrande6y ago

Love the "biggest threat" for Telegram!

jbduler6y ago

I love the work, and it is directly applicable to the work I am doing now. Have you published your thesis?

lingrino6y ago

I own the domain policies.dev and would be happy to hand it over to this project if you’re interested.

foxhop6y ago

what if you took this idea and used it to normalize privacy policies into a normalized form?!

TheUSSR6y ago

Great site! Would be nice to have the option to submit scandals as some seem to be missing.

quickthrower26y ago

This’d be great for contracts too. E.g. an NDA.

privasim6y ago

This is amazing! How would you categorize this?

privasim6y ago

This is amazing, how would you categorize this?

godelmachine6y ago

How are you better than Firefox Monitor?

ignoramous6y ago

This is great but like tosdr.org before it, what does it tell people they already don't assume to be true? I use tosdr only to often keep ignoring what it's telling me.

Also, I'm not sure if a majority care abt privacy when the value delivered is super high. They submit to the will of the service provider, as if it was the cost of doing business without realising they could either look for alternative or exercise stricter control over what they share and how [0].

To that end, I like tools that let users take action in addition to showing what's wrong rather than simply point it out. Actions can include:

- Replace: Push the users towards alternatives and help them seamlessly take their data elsewhere.

-- Help change their usage behaviour. Most digital-wellbeing / internet de-centralization tech fall under this category?

-- Translate / Pipe data exported from one service provider and import it into another. For instance, it is tiresome to move away from wordpress to ghost.org; or from WhatsApp to Signal. Emails work great.

- Reduce: Hand-hold them as they grasp various privacy and security settings on offer and exercise them, as appropriate.

-- JumboPrivacy does this for popular social networks.

-- PrivacySettings Firefox plugin for Firefox is another example.

-- Plenty write blog posts to help others navigate arduous settings across popular web properties, and expect nothing in return.

- Restrict: Provide tools that let them control what the services can and cannot collect.

-- Application sandboxes like firejail / sandboxie, firewalls like Snitch / LuLu, DNS based content blockers like pi-hole, in-browser content blockers like uBlockOrigin are some examples.

[0] One of the first questions folks asked after an exodus-privacy (which is super nice and something I use every other month) presentation at fosdem was, 'What can I do now that you've exposed what apps on Android do with the permissions granted to them and the SDKs they embed?' exodus-privacy, as great as it is, doesn't let you take action but presents a nice overview of the dangers to your privacy due to the app you've installed. Instead, you might end up having to independently discover and install Blokada or AdAway or Pi-Hole or NetGuard or XPrivacyLua or microG or GrapheneOS or...

acgan6y ago

Very cool! Reminds me of https://tldrlegal.com/ and https://fossa.com/.

j / k navigate · click thread line to collapse

120 comments

98 comments · 45 top-level

rameerezOP6y ago· 20 in thread

Hi guys!

Happy to answer Qs about the tech behind, it'd also be great to hear your feedback on what the site lacks and possible improvements!

thecleaner6y ago

Oh my god thank you so much for doing this !

nickodell6y ago

In other words, giving a single score plus a two-sentence highlight is probably about the right amount of information.

4 more replies

blondin6y ago

well, just checked and i already see the majority of the web gathering around "C". so really, we can argue both ways about this score thing...

1 more reply

epoch_1006y ago

For a site trying to fight for privacy, don't you think it would be better to not use Google Analytics to track the people who visit your site?

rameerezOP6y ago

3 more replies

brianberns6y ago

How did you create a data set large and accurate enough to be useful in training a model?

rameerezOP6y ago

1 more reply

joshspankit6y ago

jsingleton6y ago

Looks cool. It's a small point, but not pluralising words when the number is 1 shows attention to detail (e.g. 1 scandals for instagram).

rameerezOP6y ago

Thanks for the heads up!

hanoz6y ago

> It understands that "we don't share your data with third parties" is privacy friendly while "we may share your data with anyone" is a potential threat.

Does it understand "we don't share your data with just any old third parties", or "we're not like our competitors who may share your data with anyone"?

davidkuhta6y ago

Not OP, but curious, are those quotes actually from privacy policies or just hypotheticals?

1 more reply

domnomnom6y ago

How do we know it isn't just people doing the analysis if we can't actually use the AI ourselves?

rameerezOP6y ago

I'm most probably publishing a paper later this year detailing the process. Also, 80 pages of my thesis would have loved this wouldn't have involved AI to make the whole thing simpler :)

Pete-Codes6y ago

Very nice! The fact you mentioned Tinder's T+C on Twitter got my attention.

goldemerald6y ago

Does your work integrate pretrained LMs like BERT or GPT2?

rameerezOP6y ago

Yes, but not Transformer-based like these two, rather LSTM-based like ULMFiT

2 more replies

godelmachine6y ago

>>”along with other interesting stuff like recent data breaches”

Do you have an integration with HaveIBeenPwned?

crusty5116y ago

> Happy to answer Qs about the tech behind [...]

What tech is involved to get something like this on the web?

burtonator6y ago

Hey... can you reach out to me... I'm kevin ... at datastreamer.io (trying to hide that from spam but I think you can avoid that).

I'm working on something similar at the moment for a client.

Right now I'm just starting out but I've built a privacy policy classifier that is an RNN classifier based on TensorFlow that is just an 'is this a privacy policy' classifier.

I have about 650MB of privacy policies at the moment which I fetched via a crawler. I'm just about to classify the rest of them.

I'm trying to automate the whole thing so that we have a full workflow.

Anyway... ping me to discuss.

the_pwner2246y ago· 4 in thread

And for websites where participation is more optional, like HN or Reddit, you don't usually need to give much personal data anyway.

- Not about privacy

- Good for privacy # only if the other one is not about privacy

- Bad for privacy # only if the other one is not about privacy

- Better than [A/B] # only if neither is not about privacy

simongr3dal6y ago

notyourwork6y ago

I think complicated is a better choice than advanced. Intentionally complicated without adding value in a lot of cases.

1 more reply

gempir6y ago

I think this is mostly a problem of the tools used. An anchor tag by definition is an inline element, so it shouldn't really be a giant box that's clickable, so you default back to an onClick.

So I think tools like Angular, React, Vue etc. should get a better way to create links on website that just change state .

2 more replies

Swizec6y ago

> I only give personal data to websites when I have to

> you don't usually need to give much personal data anyway

Data mining 20 yes/no answers about one of your users is ... pretty easy.

epoch_1006y ago· 4 in thread

Hi. A related project that takes a more human-powered approach is PrivacySpy (https://privacyspy.org). Would be neat to see how these tools intersect.

PrivacySpy is open source, community run, and more about grading policies on a standardized rubric (as opposed to entrusting that to ML), so these tools might complement one another.

(Full disclosure: I'm a contributor to PrivacySpy.)

jchook6y ago

Another one: https://tosdr.org/

epoch_1006y ago

ToS;DR is great, although it's more focused on terms of service so if you're looking for privacy-only info, you'll have to cut through a bit of noise.

abathur6y ago

On the off chance the various responders in this sub-thread see this comment, for a while I've hoped someone would advocate for privacy/TOS policies to follow a similar model to OSS licensing.

Normal people could conceivably read and understand a given policy if the knowledge scaled.

Any substantial adoption would help focus effort/resources on services that deviate from the terms.

Sites that are already repositories of this knowledge could play some part in codifying best-practices, advocating for adoption, and tracking progress.

shlant6y ago

We have something similar as well over at https://privacy.commonsense.org/.

We have taken different ideas from many different implementations and applied them more specifically to ed tech products.

the_watcher6y ago· 4 in thread

Telegram has a 105% score? Is that expected or a bug?

rameerezOP6y ago

Bug, one of the components of the score is not properly normalized I think. I'll fix it as soon as I handle the traffic overload :)

rameerezOP6y ago

It should be pretty close to 100%, though, they seem to be super privacy friendly!

the_watcher6y ago

Yep, I read their policy. I just wasn't sure if extra credit existed.

contravariant6y ago

>BIGGEST THREAT: «We never delete your funny cat pictures, we love them too much»

I don't know, seems like a GDPR violation to me ;-).

nerdponx6y ago· 3 in thread

This is great, I've been wanting to do a project like this for a while.

Would be great to get some insight into your data collection/labeling and model design process.

rameerezOP6y ago

Some of the process on gathering the data to create the labelling dataset is described here: https://useguard.com/experiment

But most probably I'll be publishing a paper later this year detailing all the details and process :)

underanalyzer6y ago

1 more reply

ignoramous6y ago

Where can we follow you to know abt the paper when you do eventually publish?

1 more reply

paktek1236y ago· 3 in thread

I get ERR_SSL_PROTOCOL_ERROR

rameerezOP6y ago

jazzyjackson6y ago

Just a thought.

nl6y ago

When reporting this, it's useful to include your browser and exact version number. Most browser ship updated SSL cert packs with new versions, and debugging can be hard without this information.

foxes6y ago· 2 in thread

I'm not sure if you could add in extra information with some of that global information, eg the type of service, classifying different "parts" of the privacy policy etc.

raynr6y ago

Yeah, it's not comparing like for like. Feels like the system is trying to collect training data from users.

mceachen6y ago

Agreed with parent and gp. I gave up with the 10 questions, as the sentence comparisons were almost comically incomparable. I fear your model is going to be a random number generator.

1f60c6y ago· 2 in thread

I’m trying to help teach the AI, but some options don’t have to do with privacy at all.

For instance, I got (option A):

  The Games Press Web site can, optionally, store a Cookie on your computer in order to automatically log you into the site on each visit.

and option B:

  Back to Top ^ NO THIRD-PARTY BENEFICIARIES There shall be no third-party beneficiaries to this Agreement.

It would be great to have a skip/flag option for cases like these.

Edit to add:

Some other notes:

  - Mozilla, a tech company I consider ethical, is right down there with Netflix, LinkedIn and Waze
  - The box under “Sentence Breakdown by Risk Level” is empty when my ad blocker is enabled (Adguard on iOS Safari)
  - Telegram, a company I also consider ethical, has a score of 105%—is this an oversight?

quickthrower26y ago

And sometimes you get two privacy friendly policies and I'd like to say "equal".

krageon6y ago

nurettin6y ago· 2 in thread

offtopic: Is it worth using producthunt for games? Or do we concentrate our efforts on platforms like steam, appstore and play store?

lucb1e6y ago

nurettin6y ago

They are on producthunt.

1 more reply

carbocation6y ago· 1 in thread

rameerezOP6y ago

olafure6y ago· 1 in thread

Tinder and Mozilla have the same score of 33%. I don't agree with that. Tinder willingly shares very, very personal data along with your contacts (assuming without said contact consent).

scarlac6y ago

Keep in mind that it's analyzing their privacy policy, not their actions. Who a company specifically choses to share the data with and how often they do it is likely not considered.

michaelaiello6y ago· 1 in thread

Hello!

http://www.privacyparrot.com/

Made a similar project many moons ago and is still kicking along. Thanks all on HN for the feedback.

ignoramous6y ago

I used it as recently as a month ago! Thanks.

R888D06y ago· 1 in thread

You're doing incredibly noble work. Ive bookmarked the site and will make great use of it. Thank you for working hard to protect data privacy, I hope you never stop.

rameerezOP6y ago

Thank you for these words :)

Angostura6y ago· 1 in thread

I'm slightly puzzled that Telegram is scoring more than 100%. What am I missing?

rameerezOP6y ago

A bug in the scoring algorithm that I need to fix

A4ET8a8uTh06y ago· 1 in thread

I will join the chorus of great work. I love it. It may even make some people more privacy conscious ( very few people read those -- usually the ones who wrote it ).

rameerezOP6y ago

artificial6y ago· 1 in thread

This is really cool, what prompted the idea to combine the two?

rameerezOP6y ago

[1] https://news.ycombinator.com/item?id=16661735

verbify6y ago· 1 in thread

Great work! Would also work as a browser plugin.

rameerezOP6y ago

helloiloveyou6y ago· 1 in thread

Very creative project! Congratulations!!

rameerezOP6y ago

Thank you! :)

greggman26y ago

Analyse LastPass. Even without ML it's clearly saying they spy on everything you do and share it with anyone they want. I'm surprised they get recommended so often given their privacy policy.

https://www.logmeininc.com/legal/privacy

the_watcher6y ago

tboyd476y ago

Did you find that there's one single variable, like length or the presence of certain words, that the system relies on heavily?

BrS96bVxXBLzf5B6y ago

This is a good tool, but the execution of the website is disappointing.

1. Only showing excerpts of the highest threat levels. Trying to view the less severe threats asks us to email in. If you're willing to volunteer the information, why the hoops?

lab000026y ago

Holy Crap , Tinder shares your profile with potential employers ( or rather companies contracted by those employers).

So glad I gave up on online dating a long time ago.

But really this is amazing.

The next step would be be to have a lawyer write a small opinion piece on the most popular sites.

murukesh_s6y ago

Etc

Without something like that the problem is the companies could change the wordings and the neural net could not detect until trained again which is potentially more dangerous!

It's time we get some standard for user web privacy policy docs like gdpr

martin__6y ago

The problem is probably with your type mappings: https://nginx.org/en/docs/http/ngx_http_core_module.html#typ...

19ylram496y ago

To the creator:

I actually just went through the trouble of resetting my Product Hunt account (I haven’t been on there in a long time) just to give you that upvote!

Thank you for this! Cheers.

andrerm6y ago

Again, thank you for your time and work.

Edit: telegram had hacking news this year, search for "Telegram voicemail account hijacking"

brenden26y ago

Congrats on the launch!

VvR-Ox6y ago

Though it's nice you built this I'd wish we wouldn't need it because of governments who do their job and protect the people they were originally intended to serve.

How long can we keep up arming us in the battle against powerful and rich entities who steal our data and buy politicians to have direct access to the process of law-making?

eleen6y ago

a2x6y ago

Letting people doing the training without any knowledge and context seems to makes no sense.

How do you ensure, that it isn’t a bot, or an army of bots?

What is the neuronal network doing?

How do you unsure your personal integrity, those of your team members and the overall integrity of your ‘system’?

nuc6y ago

Where is Facebook? I guess that would be one of the most interesting policies to analyze.

dastx6y ago

29athrowaway6y ago

A great idea. Some ideas for features:

- Keep a timeline of privacy policy changes, being able to compare scores between 2 versions.

- Subscribe to notifications for changes in score.

- Browser extension that shows you the score on the site.

aledalgrande6y ago

Love the "biggest threat" for Telegram!

jbduler6y ago

I love the work, and it is directly applicable to the work I am doing now. Have you published your thesis?

lingrino6y ago

I own the domain policies.dev and would be happy to hand it over to this project if you’re interested.

foxhop6y ago

what if you took this idea and used it to normalize privacy policies into a normalized form?!

TheUSSR6y ago

Great site! Would be nice to have the option to submit scandals as some seem to be missing.

quickthrower26y ago

This’d be great for contracts too. E.g. an NDA.

privasim6y ago

This is amazing! How would you categorize this?

privasim6y ago

This is amazing, how would you categorize this?

godelmachine6y ago

How are you better than Firefox Monitor?

ignoramous6y ago

This is great but like tosdr.org before it, what does it tell people they already don't assume to be true? I use tosdr only to often keep ignoring what it's telling me.

To that end, I like tools that let users take action in addition to showing what's wrong rather than simply point it out. Actions can include:

- Replace: Push the users towards alternatives and help them seamlessly take their data elsewhere.

-- Help change their usage behaviour. Most digital-wellbeing / internet de-centralization tech fall under this category?

- Reduce: Hand-hold them as they grasp various privacy and security settings on offer and exercise them, as appropriate.

-- JumboPrivacy does this for popular social networks.

-- PrivacySettings Firefox plugin for Firefox is another example.

-- Plenty write blog posts to help others navigate arduous settings across popular web properties, and expect nothing in return.

- Restrict: Provide tools that let them control what the services can and cannot collect.

-- Application sandboxes like firejail / sandboxie, firewalls like Snitch / LuLu, DNS based content blockers like pi-hole, in-browser content blockers like uBlockOrigin are some examples.

acgan6y ago

Very cool! Reminds me of https://tldrlegal.com/ and https://fossa.com/.

j / k navigate · click thread line to collapse