So as basis for my thesis on AI and NLP I've been working on a RRN-based text classifier that basically reads and analyzes privacy policies. It understands that "we don't share your data with third parties" is privacy friendly while "we may share your data with anyone" is a potential threat.
I've then created this website with a bunch of analyzed services to showcase the most relevant info about each service along with other interesting stuff like recent data breaches or instructions to delete your account in said service.
Happy to answer Qs about the tech behind, it'd also be great to hear your feedback on what the site lacks and possible improvements!
I think its better (for me as a user) if you don't boil things down to a score as different people expect different things when talking privacy. It would help if you could simply highlight the potential problematic clauses in different privacy statements along with some reason why it might be problematic.
In other words, giving a single score plus a two-sentence highlight is probably about the right amount of information.
Does it understand "we don't share your data with just any old third parties", or "we're not like our competitors who may share your data with anyone"?
Do you have an integration with HaveIBeenPwned?
What tech is involved to get something like this on the web?
I'm working on something similar at the moment for a client.
Right now I'm just starting out but I've built a privacy policy classifier that is an RNN classifier based on TensorFlow that is just an 'is this a privacy policy' classifier.
I have about 650MB of privacy policies at the moment which I fetched via a crawler. I'm just about to classify the rest of them.
I'm trying to automate the whole thing so that we have a full workflow.
Anyway... ping me to discuss.
It then went to an error page instead of loading the details for Mozilla, but while it's an interesting idea I'm not sure how useful it is. I don't usually create an account on a website unless I have to do so, and the privacy policy is nonnegotiable. So why would I want to check what their privacy policy is?
I only give personal data to websites when I have to (e.g. to services that work or school use) or if I already trust the company to not do anything shady with it (Mozilla has done some sketchy stuff but I believe they won't leak my passwords).
And for websites where participation is more optional, like HN or Reddit, you don't usually need to give much personal data anyway.
Edit: the website is fully working now. Mozilla has had one security breach where emails and hashed passwords were leaked, in 2014. At the bottom the sentence breakdown is 2.5/12/22% concerning/mild/friendly. Meanwhile Reddit has no breaches, but keeps your messages forever and shares data with ad companies. Their sentence breakdown at the bottom is 3.3/23/9%. Overall, the AI rates Mozilla at 33% and Reddit at 41%. That doesn't really make sense to me.
I would really like to see more details about the privacy policy sentences on the website. If 2.5% of Mozilla's privacy policy is very concerning and 12% is mildly bad, I would like to see the actual sentences to know the risks. There is a button to view the full annotated policy, but clicking it says to send an email to you. Edit: this seems like a bug, it shows a few sentences in a WebKit-based browser [Falkon] but in Firefox it just shows the chain link icons.
Finally, I took the A/B test from Guard, and quite of a few A/B choices seemed to not really have anything to do with privacy. If the dataset is kept the same, then I think a different test format would be to rate each A and B snippet as:
- Not about privacy
- Good for privacy # only if the other one is not about privacy
- Bad for privacy # only if the other one is not about privacy
- Better than [A/B] # only if neither is not about privacy
Anyway, the data itself may be somewhat useful to me if I want to learn more about a company's privacy practices. But for normal people, I think it would be helpful for the website to also explain why privacy is important and why people should care.
onClick in popular frameworks just means left click and nothing more, which makes sense, except for that use case of opening in a new tab with middle mouse button or right clicking etc. So you have to add a lot of logic to support all that.
If you break the html spec and make an anchor tag a block element then you have to deal with catching and stopping the event from it otherwise it would work as a normal link but you actually just want to change state in your JS app.
So I think tools like Angular, React, Vue etc. should get a better way to create links on website that just change state .
This is a great policy that I think more people should use. Not everything needs your real name, real birthday, or your real home address. Definitely not your real phone number. You often do need a real email address.
> you don't usually need to give much personal data anyway
This is where it gets tricky. Anonymized aggregate data can be surprisingly identifying. You only need 33 bits of information to uniquely identify any individual in the world. If your IP tells me you're from San Francisco, then I need just 20 bits of information to uniquely identify you.
Data mining 20 yes/no answers about one of your users is ... pretty easy.
PrivacySpy is open source, community run, and more about grading policies on a standardized rubric (as opposed to entrusting that to ML), so these tools might complement one another.
(Full disclosure: I'm a contributor to PrivacySpy.)
Normal people could conceivably read and understand a given policy if the knowledge scaled.
Any substantial adoption would help focus effort/resources on services that deviate from the terms.
Sites that are already repositories of this knowledge could play some part in codifying best-practices, advocating for adoption, and tracking progress.
We have taken different ideas from many different implementations and applied them more specifically to ed tech products.
I don't know, seems like a GDPR violation to me ;-).
Would be great to get some insight into your data collection/labeling and model design process.
But most probably I'll be publishing a paper later this year detailing all the details and process :)
Just a thought.
Eg one policy might be disclosing what they do (but its actually relevant to collect data, eg password manager) while the other just says "no we don't collect anything". In this case one feels like its a better option, but its not exactly the same situation, its missing some context. I feel like this could potentially bias ratings.
I'm not sure if you could add in extra information with some of that global information, eg the type of service, classifying different "parts" of the privacy policy etc.
To elaborate on your last sentence, context is critical in assessing whether a clause is pro- or anti- privacy. Is the collection of information critical to the provision of the service? What is collected, and how much? And so on.
For instance, I got (option A):
The Games Press Web site can, optionally, store a Cookie on your computer in order to automatically log you into the site on each visit.
and option B: Back to Top ^ NO THIRD-PARTY BENEFICIARIES There shall be no third-party beneficiaries to this Agreement.
It would be great to have a skip/flag option for cases like these.Edit to add:
Some other notes:
- Mozilla, a tech company I consider ethical, is right down there with Netflix, LinkedIn and Waze
- The box under “Sentence Breakdown by Risk Level” is empty when my ad blocker is enabled (Adguard on iOS Safari)
- Telegram, a company I also consider ethical, has a score of 105%—is this an oversight?After a few of those I picked one at random and then it puts me onto some kind of 'Game' which made me feel they were trying to train me instead of vice-versa. The game didn't respond to clicks so I closed the app.
As to your two options having nothing to do with privacy: The fact that there cannot be any third-party beneficiaries is in fact a baldly privacy-friendly statement, because a third party is yet another party that may influence a privacy policy in a net-negative way.
Made a similar project many moons ago and is still kicking along. Thanks all on HN for the feedback.
Did you find that there's one single variable, like length or the presence of certain words, that the system relies on heavily?
1. Only showing excerpts of the highest threat levels. Trying to view the less severe threats asks us to email in. If you're willing to volunteer the information, why the hoops?
2. "Play a short game to continue using this tool" ensures I'm not going to share this with anyone. Putting a stranglehold on users is _never_ the way forward. I might have volunteered my time if I were at home and browsing through. I can't when I'm quickly flicking through during taking a five minute break from looking at work. But it's left me with a final negative impression before being unceremoniously blocked off.
So glad I gave up on online dating a long time ago.
But really this is amazing.
The next step would be be to have a lawyer write a small opinion piece on the most popular sites.
Etc
Without something like that the problem is the companies could change the wordings and the neural net could not detect until trained again which is potentially more dangerous!
It's time we get some standard for user web privacy policy docs like gdpr
The problem is probably with your type mappings: https://nginx.org/en/docs/http/ngx_http_core_module.html#typ...
I actually just went through the trouble of resetting my Product Hunt account (I haven’t been on there in a long time) just to give you that upvote!
Thank you for this! Cheers.
Again, thank you for your time and work.
Edit: telegram had hacking news this year, search for "Telegram voicemail account hijacking"
Congrats on the launch!
How long can we keep up arming us in the battle against powerful and rich entities who steal our data and buy politicians to have direct access to the process of law-making?
How do you ensure, that it isn’t a bot, or an army of bots?
What is the neuronal network doing?
How do you unsure your personal integrity, those of your team members and the overall integrity of your ‘system’?
- Keep a timeline of privacy policy changes, being able to compare scores between 2 versions.
- Subscribe to notifications for changes in score.
- Browser extension that shows you the score on the site.
Also, I'm not sure if a majority care abt privacy when the value delivered is super high. They submit to the will of the service provider, as if it was the cost of doing business without realising they could either look for alternative or exercise stricter control over what they share and how [0].
To that end, I like tools that let users take action in addition to showing what's wrong rather than simply point it out. Actions can include:
- Replace: Push the users towards alternatives and help them seamlessly take their data elsewhere.
-- Help change their usage behaviour. Most digital-wellbeing / internet de-centralization tech fall under this category?
-- Translate / Pipe data exported from one service provider and import it into another. For instance, it is tiresome to move away from wordpress to ghost.org; or from WhatsApp to Signal. Emails work great.
- Reduce: Hand-hold them as they grasp various privacy and security settings on offer and exercise them, as appropriate.
-- JumboPrivacy does this for popular social networks.
-- PrivacySettings Firefox plugin for Firefox is another example.
-- Plenty write blog posts to help others navigate arduous settings across popular web properties, and expect nothing in return.
- Restrict: Provide tools that let them control what the services can and cannot collect.
-- Application sandboxes like firejail / sandboxie, firewalls like Snitch / LuLu, DNS based content blockers like pi-hole, in-browser content blockers like uBlockOrigin are some examples.
[0] One of the first questions folks asked after an exodus-privacy (which is super nice and something I use every other month) presentation at fosdem was, 'What can I do now that you've exposed what apps on Android do with the permissions granted to them and the SDKs they embed?' exodus-privacy, as great as it is, doesn't let you take action but presents a nice overview of the dangers to your privacy due to the app you've installed. Instead, you might end up having to independently discover and install Blokada or AdAway or Pi-Hole or NetGuard or XPrivacyLua or microG or GrapheneOS or...