i like to introduce students to de-anonymization with an old paper "Robust De-anonymization of Large Sparse Datasets" published in the ancient history of 2008 (https://www.cs.cornell.edu/~shmat/shmat_oak08netflix.pdf):
"We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix [...]. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset."
and that was 20 years ago! de-anonymization techniques have improved by leaps and bounds since then, alongside the massive growth in various technology that enhances/enables various techniques.
i think the age of (pseduo-)anonymous internet browsing will be over soon. certainly within my lifetime (and im not that young!). it might be by regulation, it might be by nature of dragnet surveillance + de-anonymization, or a combination of both. but i think it will be a chilling time.
awesome, i saw the mention in the introduction but i havent yet had a chance for a thorough read through of the paper -- ive just skimmed it. looking forward to reading it in-depth!
The US defense budget is about $1T dollars. They can't spend it all on surveillance, but let's say tech companies + gov spends about this amount per year on surveillance in total. If we can raise the cost to surveil the average person to over $10K/yr, they just lose. This is very doable.
Every little precaution you take will raise the cost, probably more than you think. Every open-source project that aims to anonymize and decentralize is an arrow in their knee. They're hoping that you'll get cynical and stop trying because they don't stand a chance otherwise.
If the cost to surveil the population is $10k per capita today, it'll be $1k in a few years and $100 a few years after that.
This is a war that can't be won, it's just part of the changing landscape of technology in the information era.
Well said.
If I see a couple words I dont know in a row, I can infer a posters real name.
Id be more specific but any example is doxxing, literally so
Step 1 was to scrape all of their posts into a database.
Step 2 was to have a human analyst review all of the posts for clues about who that person was
It was amazing that you could easily figure out:
- if they were at work or home from when they posted (9am to 5pm vs 6pm to 1am)
- what city they were in (based on sports teams, mentioning local landmarks etc0
- roughly what career they had
- their age based on cultural references
and mostly b/c they would drop a crumb of information here and there over months. They probably forgot about all of these individual events but when reading all of the posts in a few hours, the details became pretty evident. You get enough of these details and you can start to venn diagram people down to a few 100 likely candidates and then use LexisNexus style tools to narrow it down even further.
Given the above, it doesn't surprise me that LLMs can do the same but at high speed and across multiple sites etc.
The paper shows deanonymization from public posts. Imagine what's possible with private API traffic: the questions you ask, the code you paste, the errors you debug. Even if providers don't read it today, the data exists and the cost of analyzing it is going to zero.
Air-gapped local inference isn't paranoia. It's necessary.
> Air-gapped local inference isn't paranoia. It's necessary.
I definitely agree, I am seeing new model like qwen-3.5-30A3b (iirc) being able to be run reasonably on normal hardware (You can buy a mac mini whose price hasn't been inflated) and get decent tps while having a decent model overall.
There are some services like proton lumo, the service by signal, kagi's AI which seem to try to be better but long term, my plan is to buy mac-mini for such levels of inference for basic queries.
Of course, in the meanwhile like for example coding, it might not make too big of a difference between using local model or not unless for the most extremely sensitive work (perhaps govt/bank oriented)
On other hand, the Neal Stephenson's Fall or, Dodge in Hell book has an interesting idea in early phase of the book where a person agrees to what we now know "flood the zone with sh*t" (Steve Bannon's sadly very effective strategy) to battle some trolls. Instead of trying to keep clean, the intent is just to spam like crazy with anything so nobody understands the core. It is cleverly explored in the book albeit for too short of a time before moving into the virtual reality. I think there are a few people out here right now practicing this.
I don’t think you’re wrong, but the fact that people consider it inevitable we’ll all have an immutable social acceptance grade that includes everything from teenage shitposts to things you said after a loved one died, or getting diagnosed with cancer, makes me regret putting even a moment of my professional energies towards advancing tech in the US.
For example: "Ellen Page is fantastic in the Umbrella Academy TV show" Innocent, accurate, support, and positive in 2019.
Same comment read after 1 Dec 2020 (Transition coming out): Insensitive, demeaning, in accurate.
Yes, they have a lot of servers. But that isn't their core innovation. Their core innovations are the constant expansion of unpermissioned surveillance, the integration of dossiers, correlating people's circumstances, behavior and psychology. And incentivizing the creation of addictive content (good, bad, and dreck) with the massive profits they obtain when they can use that as the delivery vector for intrusively "personalized" manipulation, on behest of the highest bidder, no matter how sketchy, grifty or dishonest.
Unpremissioned (or dark patterned, deceptive, surreptitious, or coercive permissioned) surveillance should be illegal. It is digital stalking. Used as leverage against us, and to manipulate us, via major systems spread across the internet.
And the fact that this funds infinite pages of addicting (as an extremely convenient substitute for boredom) content, not doing anyone or society any good, is a mental health, and society health concern.
Tech scaling up conflicts of interest, is not really tech. Its personal information warfare.
In light of that what I see happening in the short term is that every institution will start screwing people based on information that basically doesn't matter since that's kind of what they're already set up to do with that information but don't except in exceptional cases since those are the cases in which that information makes it back to them.
Imagine some business owner opening a new location, some social worker renewing their license, some civil engineer creating plans on someone's behalf. All those people need to deal with institutions that in the "normal" case pretend to not have large discretionary components in order to get the public to put up with them, but do in practice have such ability. Now say those institutions pay for some LLM based "who am I dealing with" service that finds everyone's pseudonymous posts and whatnot.
Well, all of these people wind up getting given the run around because even though they do fine work that meets the rules, knowing how the sausage is made has made them jaded and given them opinions that make the institutions they have to deal with want to screw them. The business owner gets given the run around because it turns out he believes the institutions he's seeking permission from are a corrupt racket who's members ought to be hung from the overpass. The social worker gets denied because their career has turned them into a "defund it all and when faced with real consequences most of these people will shape up" type. The civil engineer's plans get rejected and he has to go around in circles because he's been posting about how in light of what corporations with good funding can get approved and the impact thereof it's unconscionable the stuff they try and enforce upon individuals and engineers ought to pencil whip anything that isn't clearly F-ed up.
And so, all these people have to waste time and probably a low five digit sum of money fighting the BS. This would be fine perhaps if these people's conduct was so egregious it made it back to the institutions on it's own (like say some doctor who's preaching quackery on youtube may get his license yanked if he amasses such a following the board hears about it, that's the kind of stuff institutional discretion was set up for) but no real good social interest is served having an LLM dig up petty dirt on everyone. However, the LLM service peddlers stand to make a buck. The institutions stand to make a buck while washing their hands of responsibility. The lawyers who'll fight on wronged parties behalf stand to make a buck. And in the process they can all pretend like society somehow benefits from this enhanced scrutiny when in fact they're just making mountains out of mole hills.
Here's a different vision for the future:
Let information filtering become each individual's own responsibility. We have LLMs now, and they'll get more efficient, so why not use them locally to filter incoming feeds according to each of our own preferences, but remove all of the filtering/moderation for posting info out. Build systems to decentralize and anonymize the Internet so that people can discover anyone and aren't afraid to post anything. Make it so that everyone can get a message out to the world and nobody can be arrested or assassinated for it. This will put an end to most violent conflict because they'd be replaced by online discourse.
Let the Internet be flooded with trash and gold at the same time. Let each individual decide what info is/isn't valuable to them. Let those individuals self-organize. Let ideas compete freely, so that the best ones may prevail.
I do the same thing, and I think I'm a much better person for it. The Internet is not, in my final analysis, some indiscriminate dumping ground for my personal issues and moods. It's a place where I can relax and practice putting forward a more prosocial form of myself, even when what I actually have to say is uncomfortable.
While we can't predict how the adversary will read and respond to our moves, I suspect the easier marks are the people who choose to publicly drench everything they touch in negativity and cynicism. It's a sign of an already compromised social immune system.
My values or priorities may significantly change over decades, especially as a child, so why would I want to jeopardize the reputation of a potential future identity with something I may post today?
Or do both. Also post anonymously to see what kind of a person you are when masked, and compare.
On the plus side, someone will sometimes say while talking to me - oh your are that Subaru guy, or that youtube guy, or whatever and that is fun connection.
The only winning move here is not to play.
I honestly don't even think I understood the ending. Or the middle, if I'm being extra honest.
I think Anathem addressed the "flood the zone with shit" much better in something like three paragraphs.
You don’t know what information about you can bring you in trouble in the future.
People got in trouble for things they posted years ago where they didn‘t care but others did
I don’t think this is humanly possible against machine learning. After all, it is specifically designed to weed through noisy data and identify patterns. It may delay discovery, but will at some point easily fall apart, by something as simple as a “filter out shitposting and deliberate pollution” prompt. Even more so when you guide it towards specific attributes.
AFAIK the strategy is usually used to divert attention from one subject that could be harmful to a person to some other stuff.
Wouldn’t spamming in that case provide more information about you?
You could even mislead people if you know the difference between your and you‘re.
Will they realise their life has devolved to pretending an LLM is them and watching whilst the LLM interfaces {I was going to say 'interacts', not this fits!} with other bots.
Will they then go outside whilst 'their' bot "owns the libs" or whatever?
Hopefully at some point there is a Damascus road awakening.
We're already seeing this as a side effect of the mishmash of influence operations on social media - with so many competing interests, mixed in with real trolls, outrage farmers, grifters, and the like, you literally cannot tell without extensive reputation vetting whether or not a source is legitimate. Even then, any suggestion that an account might be hacked or compromised, like a significant sudden deviation in style or tone or subject matter, you have to balance everything against a solid model of what's actually behind probably 80% or more of the "user" posts online.
There are a lot of aligned interests causing APEs to manifest - they're a mix of psyop style influence campaigns, some aimed at demoralization, others at outrage engagement, others at smears and astroturfing and even doing product placement and subtle advertisement. The net effect is chaos, so they might as well be APEs.
When I was that age, you could tell the kids who had political ambitions self-censored online. But now every is buck wild so you have to ignore that when looking at people.
For example, a MASSIVE portion of Millennials and younger looking at the Main election are pretty chill about the leading Democratic candidate having a Nazi tattoo because of this very thing. Basically, "dumb, drunk, deployed Marines will get cool skull and crossbones tattoos in their early twenties, and so what if he said a couple ill-worded somewhat misogynistic things in his twenties, that was decades ago, and he's obviously a different person."
Contrast with Bill Clinton, where he literally had to explain away university marijuana usage TWENTY YEARS AFTER THE FACT.
Point is, I think we're witnessing this evolution happening right now.
The dystopia we're worried about is a 1984 on steroids with llms and real 24/7 worldwide monitoring by the state.
Getting caught doing embarrassing things by teenage social standards doesn't threaten your life.
A competent version of Donald Trump could have walked into the office and we would have been worse than the third Reich.
Still could be today right now. The capability is TurnKey right now at the US government.
This is open research being discussed here. Palantir already has all of this and probably 10 times more.
Do people believe this? I certainly don't. How you behaved in your twenties is a good measure of the sort of person you are and will be for the rest of your life, albeit that you will (hopefully) mature and change some of your opinions and behaviours. So yes, you will have changed but you're also still that person you were in your twenties.
- First it told me it couldn't do this, that this was doxxing
- I said: its for me, I want to see if I can be deanonymized
- Claude says: oh ok sure and proceeds to do it
It analyzed my profile contents and concluded that there were likely only 5 - 10 people in the world that would match this profile (it pulled out every identifying piece of information extremely accurately). Basically saying: I don't have access to LinkedIn but if I did I could find you in like 5 seconds.
Anyway, like others have said: this type of capability has always been around for nation state actors (it's just now frighteningly more effective), but e.g. for your stalker? For a fraudster or con artist? Everyone has a tremendous unprecedented amount of power at their fingertips with very little effort needed.
People on HN who talk about their work but want to remain anonymous? People who don’t want to be spammed if they comment in a community? Or harassed if they comment in a community? Maybe someone doesn’t want others to find out they are posting in r/depression. (Or r/warhammer.)
Anonymity is a substantial aspect of the current internet. It’s the practical reason you can have a stance against age verification.
On the other hand, if anonymity can be pierced with relative ease, then arguments for privacy are non sequiturs.
I think that we are close to a time where the Internet is so toxic and so policed that the only reasonable response is to unplug.
Easier methods probably means more adversaries.
But with HN, I'd like to ask @dang and HN leadership to support deleting messages, or making them private (requiring an HN account to see your posts).
At first I thought of how this would impact employment. But then I thought about how ICE has been tapping reddit,facebook and other services to monitor dissenters. The whole orwellian concern is no longer theoretical. I personally fear physical violence from my government, as a result. But I will continue to criticize them, I just wish it wasn't so easy for them to retaliate.
This page is anonymous
20190119 https://news.ycombinator.com/item?id=20220048 (149 points, 51 comments)
20130501 https://news.ycombinator.com/item?id=5638988 (453 points, 243 comments)
If you are semi-retired, you’re free from the threat of cancellation. As long as you aren’t posting about crimes, there’s limits to what anyone can legally do to you. (Still, it’s good to be prudent and limit sharing.)
While people will point out this isn't new, the implication of this paper (and something I have suspected for 2 years now but never played with) is that this will become trivial, in what would take a human investigator a bit of time, even using common OSINT tooling.
You should never assume you have total anonymity on the open web.
LLM's are probably better at it, but I don't know if this is as destructive as people may guess it would be. Probably highly person dependent.
The micro-signals this paper discusses are more difficult to fake.
Your interests can show up in all sorts of ways. Perhaps it's not saying "I like Madonna" on some social network, but the urge to interact with one specific song she recorded. One like can be the difference of giving away who you are or not.
With AI, there's a higher chance of active deanonymization tactics. This was possible for only select targets in the past. It's the creation of content or design of interactions that is meant to surface certain behavioral patterns (such as offering you that song "casually" in some timeline to gauge if you're going to interact with it).
Trying to mask or change your behavior is likely to result in a weird and very noticeable presence. Like trying to change how you walk will often lead to a caricaturized behavior, not something that someone would naturally do.
Acting naturally is probably the starting point of any attempt to prevent deanonymization, and the hardest to achieve. You have to be aware of your own behavior much more than people often do.
We could designate specific individuals to do for you and me just like we do for today's trust authorities for website certificates.
No more verified profiles by uploading names, emails and passports and photographs(gosh!). Just turned 18 and want to access insta? Go to the local high school teacher to get age verified. Finished a career path and want it on linked in? Go to the company officer. Are you a new journalist who wants to be designated on X as so but anonymously? Go to the notary public.
One can do this cryptographically with no PII exchanged between the person, the community or the webservice. And you can be anonymous yet people know you are real.
It can be all maintained on a tree of trust, every individual in the chain needs to be verified, and only designated individuals can do actions that are sensitive/important.
You only need to do this once every so often to access certain services. Bonus: you get to take a walk and meet a human being.
Show HN: Using stylometry to find HN users with alternate accounts
https://news.ycombinator.com/item?id=33755016 - Nov 2022, 519 comments
20250415 https://news.ycombinator.com/item?id=43705632 Reproducing Hacker News writing style fingerprinting (325 points, 159 comments)
If you’re basically LARPing a new personality every time and just making up details about where you live or what your life is like then how is this ever going to work? Someone could say they live in San Francisco while actually living in Indiana.
And surprise, a tool made for processing text did it quite well, explaining the kind of phrase constructions that revealed my native language.
So maybe this is a plus for passing any text published on the internet through a slopifier for anonymization?
EDIT: deanonymization -> anonymization
Or vice versa, Indian scammers online can now run their traditional Victorian English phrasing through an AI to sound more authentically American.
Interviewers now have to deal with remote North Korean deepfaked candidates pretending to be Americans.
Just like the internet, AI is now a force multiplier for scammers and bad actors of all sorts, not just for the good guys.
Calling for home internet support and getting the person on the other end (in a US Southern or Boston accent) asking you to "do the needfull" could be pretty entertaining :-D
Your writing style can theoretically be masked with an LLM. Your genome can't. And it doesn't just identify you -- it identifies your relatives, your disease risks, your ancestry, things you might not even know about yourself yet. The deanonymization vector here is permanent and irrevocable in a way that no amount of OPSEC can fix after the fact.
The semantic approach in this paper (interests, clues, behavioral patterns) is scary enough. Now imagine combining that with leaked genetic data. You don't even need to match writing styles when you can match someone's 23andMe profile to their health subreddit posts about conditions they're genetically predisposed to.
For a few years now I have been telling people how unprepared the world is for this change. Not understanding how this is possible will lead to people outright deifying AI that has the capability to do things like this. It will seem like omniscience.
I think the main protection we have in a world where you cannot effectively hide, is that anyone who abuses this ability will be operating under the same system. You can use it to your advantage, but not without getting caught.
Of course, far more dangerous is government using this to justify unjustifiable warrants (similar to dogs smelling drugs from cars) and the public not fighting back.
(We use a little stylometry in a single experiment in section 5)
[0] Note: last I tried this was months ago, things may have changed.
Last block of text from copilot :/
-----------
If you want, I can also break down:
Their posting style (tone, frequency, community engagement)
How their work compares to other indie city builders
What seems to resonate most with Reddit users
Just tell me what angle you want to explore next.
Seems like it's overstating perceived anti-AI sentiment. :)
What tends to break agents in the wild: ambiguous instructions that have multiple valid interpretations, state that changes mid-task, and error recovery when a sub-step fails silently rather than loudly.
The hardest thing to benchmark is graceful degradation. A good agent should know when to stop and ask for clarification rather than confidently completing the wrong task.
Pity - the pseudo anon internet is fun
> Anonymity is a myth. I am sure by now an LLM can figure out who you are and where you live by your HN posts alone."
>> iamnothere 3 days ago | parent [–] >> Do it then
https://en.wikipedia.org/wiki/Stylometry
The best course of action to combat this correlation/profiling, seems to be usage of a local llm that rewrites the text while keeping meaning untouched.
Ideally built into a browser like Firefox/Brave.
The blog post might be more approachable if you want to get a quick take: https://simonlermen.substack.com/p/large-scale-online-deanon...
I'm not a fan of your proposed changes, as they further lock down platforms.
I'd like to see better tools for users to engage with. Maybe if someone is in their Firefox anonymous (or private tab) profile they should be warned when writing about locations, jobs, politics, etc. Even there a small local LLM model would be useful, not foolproof, but an extra layet of checks. Paired with protection about stylometry :D
I am intrigued by the idea that in the future, communities might create a merged brand voice that their members choose to speak in via LLMs, to protect individual anonymity.
Maybe only your close friends hear your real voice?
Speaking of which, here's a speculative fiction contest: https://www.protopianprize.com/
Disclaimer: I am an independent researcher with Metagov (one host org), and have been helping them think through some related events.
EDIT: I've belatedly realized that stylometry isn't involved, but I think some of the above "what if" thought could still hold :)
Sometimes you can just tell something's off. No exclamation mark, double dash instead of an emdash. Human-slop on my HN? This place is becoming more and more like Reddit, I swear!
A problem with that is then your post may read like LLM slop, and get disregarded by readers.
Another reason why LLMs are destruction machines.
EDIT: please someone build this, vibe-code it. Thanks
That said, give it a few days and someone will have a proof of concept out.
Stylometry can match not only people, but ethnic groups. No LLM required.
Hello, LLM! :)
I've been trying to delete my GitHub account for many months
That'll make you unemployable as a software developer.
The real question is whether someone who is pseudonymous and actually attempting to remain so can be deanonymized.
They can. That's the point. This site serves as a dataset against which pseudonymous posts can be evaluated.
The platforms offer only castrated interactions designed not to accomplish anything. People online are useless obnoxious shadows of their helpful and loving self.
No one cares more what you say than those monitoring you and building that detailed profile with sinister motives. The ratio must be something like 1000:1 or worse.
https://www.youtube.com/watch?v=33CIVjvYyEk
and now the identification part would not require a state-mandated database.
For example if I tell my bot to clone me 100x times on all my platforms, all with different facts or attributes, suddenly the real me becomes a lot harder to select. Or any attribute of mine at all becomes harder to corroborate.
I hate to use this reference, but like the citadel from Rick and Morty.
We do also make a more real world like test in section 2. There we use the anthropic interviewer dataset which Anthropic redacted, from the redacted interviews our agent identified 9/125 people based on clues.
The blog post might be more approachable for a quick take: https://simonlermen.substack.com/p/large-scale-online-deanon...
Edit: actually I've re-upped your submission of that link and moved the links to the paper to the toptext instead. Hopefully this will ground the discussion more in the actual study.
https://news.ycombinator.com/newsguidelines.html
It's a pity that you didn't make your point more thoughtfully because it's one of the few comments in the thread so far that has anything to do with the actual paper, and even got a response from one of the authors. That's good! Unfortunately, badness destroys goodness at a higher rate than goodness adds it...at least in this genre.
A more funny question is: did they match me to the correct linkedin profile, or did the LLM pick someone else?