The naughty username checking system used by Twitch (opens in new tab)

(ghostbin.com)

565 pointsPrincessJess4y ago337 comments

337 comments

204 comments · 57 top-level

phamilton4y ago· 29 in thread

We had to do this for a link shortening system (to make sure random base64 didn't contain profanity). It was a pretty fun problem. Not just the implementation, but doing the math to make sure it didn't make our shortened links easily enumerable. The implementation wasn't too bad, but we set up logging initially to spit out any random strings it decided to block. I demo'd this in front of the whole company and live tailed the logs and the first one that popped up during the demo was a big ole F bomb. It made for an excellent demo.

beachy4y ago

> to make sure random base64 didn't contain profanity

I would have said "why bother" until this happened to us.

A customer rang us up in a fury because some demo/ random data that we generated happened to have the word "penis" in it. They were convinced we must have put it there because we thought he was a cock. It was very difficult to defuse the situation.

eps4y ago

I have a bus ticket from Stockholm with the serial of F4CK. Totally made my day back then to be honest.

2 more replies

dmd4y ago

I was pretty happy when my randomly generated imgur URL of a photo of a telephone cake I made ended up being xxTEL.

https://imgur.com/gallery/xXtEL

unknownOrigin4y ago

> I would have said "why bother" until this happened to us.

Aah, the good ole "one customer is unhappy, let's waste a week of time on this" approach to IT management. Takes guts to tell such customers "here's your refund, now piss off", but it is the right thing to do.

1 more reply

shpx4y ago

Recently on HN someone thought that reddit.com/imgur directing to a post on /r/Drugs meant something when it's just a randomly generated ID. All 5 letter word that I tried worked because there's been so many posts.

https://news.ycombinator.com/item?id=28676096

boyter4y ago

Reminds me of this great story https://thedailywtf.com/articles/The-Automated-Curse-Generat...

resonious4y ago

I just recently saw a randomly generated ID of ours in production that starts with "doggy". Thankfully "doggy" is pretty innocent, but it really made me think "wow what if it was something bad". Unfortunate that that exact scenario seems to have happened to you already.

randac4y ago

I got a CAPTCHA variant not long ago that was along the lines of "U2KYS". Pretty toxic!

1 more reply

foldr4y ago

We had a similar situation with a random name generator that just picked first names and last names at random. One result of this was 'Gaylord Dickinson', which sounds like it could only possibly have been made up as a homophobic joke, but which was just the random combination of two quite common first and last names.

ademarre4y ago

When it comes to censoring randomly generated strings, I like simply to omit vowels from the alphabet. Usually I'll omit some of the more obvious lookalikes too, e.g. [1 0 v].

It's a simple solution. Sure, it is still possible for something to slip through that looks similar to something bad. But the potential to strongly offend is greatly reduced.

cratermoon4y ago

Yeah there was an article I read a while back about a company looking to prevent the use of 'naughty' words in randomly-generated strings used as event IDs. Apparently someone with some pull had seen a message with an offensive word. Some management committee spent a long time trying to figure out how to solve the problem including proposing keeping a list of bad words, and then worrying about what should be in it and who would maintain it. At some point an engineer got a chance to speak and said something like "just use base-31 and omit vowels". The story as I remember it didn't mention the use of v or l33t-speak, but they were randomly generated, not maliciously constructed, values.

Also note that if you're too naive about checking for 'naughty' words, you get https://en.wikipedia.org/wiki/Scunthorpe_problem

2 more replies

xdfgh11124y ago

Gov.uk did this. The code WNKR still ended up on Reddit yesterday.

1 more reply

tapland4y ago

Removing wovels would increase likelyhood of bad word combos, but removing consonants would have the desired effect.

smohare4y ago

It’s so laughable that we care about whether a generated string contains some temporally relevant profanity. We truly are still barbarians, and will be viewed as such by history.

gregoriol4y ago

It highly depends on what the purpose of the string is: if it can be clearly seen by the user, has to be read, or worse typed, then it's not just a random string in a database.

If the string is a url, imagine sending https://somesite/wanker to your client, when it actually could also be https://somesite/ay3ugd

tsimionescu4y ago

Well, given that we have a more than 5000 years old habit of looking for omens in random data to divine the future (whether that random data is scattered bones, laid out animal entrails, tea leaves, coffee grounds, tarot cards or so many others), it's unfortunate but not surprising.

Biganon4y ago

It's not that stupid. People will send the shortened link to other people, who might not understand that the string was randomly generated.

1 more reply

Veen4y ago

If you generate an identifier for an important client that contains "knobhead," they won't think it's a randomly generated string, but that someone at your company is deliberately insulting them.

pchristensen4y ago

We did a similar thing at Groupon after a customer’s coupon code contained an F bomb.

EvanAnderson4y ago

I removed the letter U from a random password generator for a Customer's app after a password was generated containing the "C-word".

2 more replies

mcintyre19944y ago

Hashids (https://hashids.org/#how-does-it-work) have a pretty clever trick for this. They’re able to encode multiple IDs to a single obfuscated hash, which works by reserving some characters from the alphabet to use as a separator between each encoded value. That guarantees that whatever characters you choose to be separators are never next to each other in the output. By default their separators are (lower + upper case) “c, s, f, h, u, i, t”

It worked surprisingly well when we used it.

geoduck144y ago

>the first one that popped up during the demo was a big ole F bomb

And this, ladies and gentlemen, is what it would show BEFORE the filter... but after (runs the code again, and prays it works) ... NO PROFANITY!

phamilton4y ago

To be clear, this was the list of "things we blocked".

"Had we not done this work, that link would have been sent out to one of our users." was very well received.

delsarto4y ago

In my help desk days I took a call from an irate person ranting about how we were telling her to "get a male sex change". Eventually I figured out she had become upset with "msexchange" showing up in the address!

CRConrad4y ago

Before Stack Overflow there was Experts Exchange. Their URL was of course those two words, all lowercase and mashed together into one... (Can't recall if they later inserted an underscore in between them?)

1 more reply

omegalulw4y ago

How statistically likely is this? Can't you just regenerate until no you have a suitable short URL. Aside from performance, this is as random as you can be. Or generate the characters one by one and backtrack, this requires less random days. Or regenerate the unwanted substring.

phamilton4y ago

Just trying again is absolutely fine.

Our main concern was whether we needed to increase the size to 26 to account for the loss of keys. After doing the math, a 25 digit random string has a ~5% chance of containing one of 150 three or four character inappropriate substrings. That 5% loss isn't that big of a deal. But we had to figure out the math as part of due diligence before shipping.

Waterluvian4y ago

Fascinating! I guess an easy solution is to inject non alpha characters into any generated string. I imagine a constraint was that you wanted them to be easy to type?

phamilton4y ago

SMS is the biggest constraint. Unicode characters trigger lower segment char limits (effectively doubling the cost of a 71 char text message). And also it's important that the links can be clicked on a smartphone. So url-safe base64 (some shorteners use base62). And numbers can be N4u6hty too, so you gotta catch those cases.

1 more reply

marcus_holmes4y ago· 27 in thread

We used to call this the "Scunthorpe problem" - how do you filter for obviously rude names while still allowing people to have their actual names? Bearing in mind that some people's names are actually rude.

I worked on the application form processing for the Nectar card launch in the UK back in the early 2000's, and we had several cases. Luckily we had human data-entry clerks in the loop, so all we had to do was flag when a name contained anything on the "Scunthorpe list" and get a human to look at it. Even then it wasn't perfect and a few slipped through. One of the early PR messes of the launch was someone getting a Nectar card issued with a rude name, and of course they immediately went to the press with it [0]

I'm interested to see they still haven't solved this [1]

[0] I never saw the journalistic interest in this: this guy said his name was <rude name>, and then signed the form to say everything on the form was true. Why is anyone surprised or interested that we accepted his name was what he said it was and gave him a card in that name?

[1] https://metro.co.uk/2015/02/20/woman-refused-sainsburys-nect...

qwerty4561274y ago

People should just be taught (I would consider this is an essential skill which should be put in basic school curriculum to psychologically prepare people for the life, many support/sales professionals learn it quickly but the rest of the people don't) words are just words and they don't have to take them seriously. No word is rude until you consider it rude. No word can actually harm you directly. It's you who is in charge of deciding what to feel whenever you hear/see a particular word. Sadly, most of the people just ignore this job. I personally enjoy being called by the derogatory nickname of my ethnicity because I just like how does it sound. First half of my life I didn't even know it was supposed to be insulting (as I didn't know actual racists exist and can be serious about hating/judging people by just their ethnicity).

marcus_holmes4y ago

We used to be taught that "sticks and stones may break my bones but words will never hurt me".

I'm not so sure that has ever been true. Emotional bullying was always a thing, and it did hurt. I'm glad that we're now taking that more seriously.

But it's also true that my emotions are my responsibility, and my emotional reaction to words is unique to me. I cannot force everyone else to be responsible for how I feel about their words. And I cannot expect everyone else to anticipate how I might feel about what they want to say and modify their speech accordingly.

There's a balance in there. I suspect that this balance is what we call "good manners" or "politeness".

4 more replies

kypro4y ago

I'm quite a feminine guy and I was always called various homophobic slurs at school. Even though I wasn't gay I didn't understand why it would matter if I was so I would just say "ok".

Even when the insults could be considered objectively offensive I'm still generally okay with it. I've been called names quite a bit in the workplace over the years and I'm okay with it if makes them happy. Being autistic I don't really care and neurotypical people seem to get enjoyment calling people names so it's win-win.

High EQ individuals have told me this is wrong and the appropriate reaction would be to take offence and try to make them feel bad, but I've argued this would just result in an objective reduction in happiness in the world so it wouldn't make sense. Plus, most of the time my colleagues have been nice to my face so it's not like it's ever got in the way of me doing my job.

6 more replies

secondaryacct4y ago

What s rude is the shared imagery not the words. If I call myself MyDirtyDickInYourDogBringsMeToOrgasm, imagining this scene is disgusting enough to ask me to change it.

Words are not just symbols on paper, they are image triggers in your brain.

2 more replies

dynamite-ready4y ago

I disagree. Language is very often used as a way to include and exclude people in and from various social groups. You're right in that there's never any direct harm in the sound of the words themselves. But you can say the same thing about verbally threatening physical harm (or even psychological harm) to someone, and would be very wrong in your assumption.

1 more reply

Angostura4y ago

> No word can actually harm you directly.

"Sticks and stones may break my bones, but words will never hurt me" is something told to children in the playground who are too young to know better.

You might as well argue that there's no such thing as malware, because its entirely harmless until executed on a processor.

Cthulhu_4y ago

So "get over it" is your solution? Sounds like you've never been a victim. Place yourself in someone else's shoes please.

I mean I don't like the (self) censorship either and don't mind rude or mature words, since we're all mature here and calling a fucking asshole a f*ing a$$h0le is an insult to people's intellect.

1 more reply

imglorp4y ago

> Uhura : But why should I object to that term, sir? You see, in our century, we've learned not to fear words.

1 more reply

raxxorrax4y ago

I think words can hurt people in certain contexts, but it is very rarely about a single term.

noneeeed4y ago

It's getting particularly problematic for international sites. What constitutes offensive language in one country might be a perfectly normal name, or product, in another country that uses the same language, let alone in another language.

There's been a running joke on the UK Reddit subs recently about people getting short bans for using the term "faggot" as it's now on some automatic, site-wide blocklist. The bans are completely context free, so people are supposedly getting banned for discussing the food product, a kind of large meat-ball made primarily from minced offal. The same happens with "fag", meaning a cigarette.

dkdbejwi3834y ago

There's the opposite problem too, where "fanny" has a different meaning in the UK vs. USA. There's the classic line from "The Office" about it ("over there fanny means your arse. Not your... minge") and I think most folk are pretty aware of the American meaning here these days.

I wonder, is it too much to ask that we (as in, the various English speaking places) understand the different meanings of words that are offensive in one dialect but have a different and mundane meaning in another dialect?

chx4y ago

That this system is so primitive doesn't surprise me at all. After all, people with a family name "Null" often get database errors -- despite even Oracle, which can't tell null and empty string apart, decidedly has IS NULL vs == "NULL" comparisons... https://www.bbc.com/future/article/20160325-the-names-that-b...

newforms4y ago

Little Bobby Tables struggles

exar08154y ago

Its a similar problem to people having names that are considered "fake". I shudder what people actually named Harry Potter or James Bond have to go through regularly.

Also, I dont see any sense in actually enforcing any of those lists. I always think about the "Journey of Life" in The Grand Tour, which, frankly, is completely inoffensive in any other language than english https://www.youtube.com/watch?v=BLXe2WTYngQ

marcus_holmes4y ago

I live in Wedding, pronounced "Vedding". It was weird for about the first week, then it never really occurred to me again. I'm sure the people who live in Fuck have the same, right up until they have to fill in their address online.

raxxorrax4y ago

Seems like a rude name might give you some privacy advantages in the 21st century. To the price of being excluded by random services maybe.

Old math prof of mine was named Dr. Cock, and he wasn't the only one, the name was common. Students called him Dr. Octocock.

st_goliath4y ago

I have a friend who studied at the University in Linz, way back when. According to him, on the Universities System/370 they used to have an account name scheme where they simply concatenated the first 2 letters of the first name and the first 3 letters of the last name. Aledgedly, the scheme got changed after Professor Arno Schulz (https://de.wikipedia.org/wiki/Arno_Schulz) was understandably upset about his account name (https://en.wiktionary.org/wiki/Arsch).

mcintyre19944y ago

Random story I remember my ICT teacher telling us in school along the same lines as Scunthorpe. They installed some nonsense web blocking thing that schools like to have and a lot of teachers complained because it blocked weightwatchers.com

cauliflower994y ago

georgyo4y ago

I only recently learned about Scunthorpe because the username I've been using for 15 years is now affected.

Georgyo has orgy right in the name. More and more places are refusing to accept it when signing up.

OJFord4y ago

> We used to call this the "Scunthorpe problem"

Fwiw, it's well-known as such, after relatively high-profile incidents involving people with that address: https://en.wikipedia.org/wiki/Scunthorpe_problem

marcus_holmes4y ago

Is that just well-known in the UK though? I've mentioned this to people outside the UK and only occasionally have they known it.

3 more replies

chanandler_bong4y ago

The city of Toppenish in the US had the same problem when the council turned on a generic filtering system for the city networks. Everything stopped working.

chrisdengso4y ago

Haha, I call this the "Clbuttic" problem. But nowadays it can be solved with machine learning fairly easy https://moderationapi.com/blog/moderate-text-automatically-u...

marcus_holmes4y ago

That's a slightly different problem from the Scunthorpe problem.

You have a customer who gives her name as "Fanny Batter". Is that a permissible name?

What about "Juan Kerr"? Or "Amanda Huggenkiss"?

It's not about replacing "ass" with "butt", but detecting rudeness/humour in context.

I'd be amazed if machine learning can solve this one. It's extremely hard for humans.

* all those names should be rejected btw

1 more reply

xsmasher4y ago

The clbuttic problem involves replacement; original source is (AFAIK) here:

https://thedailywtf.com/articles/The-Clbuttic-Mistake-

globular-toast4y ago

If nobody ever implemented a rude word filter, nobody would ever need to implement a rude word filter.

rhema4y ago· 22 in thread

I have a hard time believing this was / is the real version used. It doesn't seem broad enough. More likely it was a kind of smoketest that made sure that a more automated keyword checker was working.

It does remind me of the XKEYSCORE (Snowden leaks) that used keywords to bubble up potential threats from emails etc https://www.businessinsider.com/nsa-prism-keywords-for-domes... .

ridaj4y ago

This looks like legit, no-nonsense gets-the-job-done code that gets updated every time some jerk find a new way to be a jerk to others. It isn't great, but at least not over-engineered, and I'm not sure if Twitch account sign-up volumes and abuse are at the point where they should staff a project to do this more robustly / scalably

OskarS4y ago

Exactly what I think. This is the kind of code that doesn't have a platonic ideal, it has to get updated with time and experience and reports. There is no "non-hacky" way to do this, you just have to look at the reports that are coming in and keep adding rules that are relevant.

2 more replies

notjustanymike4y ago

There are portions of my codebase that are intentionally "dumb" code. They contain cascading rules controlling what UI elements are visible that can be challenging to reason about. So I wrote it so simple that anyone can read it.

I see the same here. It's not clever, but no one has any doubt what words are being checked.

baud1472584y ago

it's missing a few slurs, I'm not sure it gets updated (or it's a filter elsewhere which gets updated)

throwaway9843934y ago

It seems you are assuming that software is usually written well, or as well as it can be. It's much more likely to be the opposite.

MattGaiser4y ago

This seems like the kind of thing that would be horribly specced and be a user story along the lines of "the user must not be allowed to make an inappropriate username."

The engineer would write something for every test case the product manager complained about, anything else computationally easy, and call it a day.

I once had to implement an audit logging system. What was supposed to be logged? "Important actions." Nobody on the team could define it. We just logged every database write along with the username responsible and called it a day. Nobody ever followed up or inspected it.

Same deal. Both exist mostly for compliance.

1 more reply

jldugger4y ago

Would be unsurprised if some poor engineer got assigned the project, realized it was an untractable mess of scunthorpe, and decided to check some boxes and move on to some ticket of higher value.

mc324y ago

It also mostly checks for English naughty words and not much else. People can have fun in lots of other languages, so it would seem this is a small sample.

unnouinceput4y ago

You can't do all of them. Here is an example: in my native language Pula is a slur for male genitalia (way worse than D*ick in English) but at the same time it's the name of Botswana currency (https://en.wikipedia.org/wiki/Botswana_pula) and also a city in Croatia (https://en.wikipedia.org/wiki/Botswana_pula).

Igelau4y ago

Some near the end looked like they might be in another language, but I won't be the one to find out.

It used to be "if I search for this term, am I accidentally going to wind up getting goatse or something?" The good old days.

Now it's "if I search for this term, is the FBI going to kick my door in?"

2 more replies

Agentlien4y ago

After looking through it quickly it seems to do the same as most profanity checks with Dutch: it doesn't block "kut" (a crude word for female genitalia) but it does block "kunt" (third person form of "can")

1 more reply

AmericanChopper4y ago

And specifically Italian blasphemy for some reason…

1 more reply

formerly_proven4y ago

Bunch of ineffective entries too, all patterns containing underscores won't ever match.

1 more reply

ridaj4y ago

Is Twitch's footprint big enough internationally that they have to worry about it?

gowld4y ago

That list is a list of words chosen by William Knowles to taunt any NSA who may be listening.

It's not a list of words used by the NSA or any spies. https://attrition.org/misc/keywords.html

geoduck144y ago

In a previous life, all of our code was scanned for "vulnerabilities". One of the issues they looked for was if passwords were being stored in local variables. Initially, LOTS of people would do something like:

$Username=<username>

$Password=<password>

Connection.string=($Username, $Password)

The parser would flag this - Password was being stored to a variable! So we just changed our code:

$pw=<Password>. Problem solved!

comrh4y ago

I can't find the article now but the developer who wrote these scripts said they were a singular effort from years ago before security was taken over by a more formal development team.

woodruffw4y ago

Someone in another thread mentioned that these might be part of corpus generation for an ML model. That would make more sense to me.

gipp4y ago

Based on the filepath given in this very thread, it seems plain that that's the case (safety-ml\offensive-usernames\data_pull\sql\bad.sql).

cbsmith4y ago

Training an ML model to "learn" a rules engine strikes me as an incredibly bad practice. It'd make more sense to just have an actual corpus of labeled data.

makomk4y ago

Seems plausible to me. One of the streamers on Twitch had an actual wooden board on the wall he lasered subscriber names onto, and some of the regulars had fun finding lewd usernames to gift subs to. There were quite a lot of them out there. It was kind of a running joke how much Twitch let through the cracks.

b1124y ago

If valid, this means virtually every phone call, and every email (with clients) is flagged. :P

This must be more nuanced. Maybe, it's "additional algo processing when a word is hit", eg another layer before "involve human".

sorenjan4y ago· 11 in thread

Reminds me of the guy that streamed a talking banana on Twitch, where viewers could make it say things. People submitted variations of the n-word and got him banned, and after trying to filter out all character combinations he could think of he wrote a phonetic filter. That apparently worked much better than trying to think of every permutation of characters that sounds like bad words.

https://youtu.be/bJ5ppf0po3k?t=715

geoduck144y ago

>the guy that streamed a talking banana on Twitch

Of all the ways I expected a talking banana to backfire, I didn't expect this one. Thanks for sharing

pjc504y ago

Anything exposed to the open internet devolves into porn or racism or both unless active effort is made to prevent it. I'm reminded of https://en.wikipedia.org/wiki/Tay_(bot)

1 more reply

LadyCailin4y ago

Wait, now I’m extremely curious what failure modes one would expect from a talking banana.

2 more replies

rzwitserloot4y ago

The list included mike hawk (phonetically similar to 'my...'), so they are interested in phonetics, apparently, even for these usernames. The banana streamer has their stuff set up better than twitch, then.

grumpwagon4y ago

From the video: "It was a mix of racism and creativity"

That sums up the WoW Classic (and I'm sure many other gaming communities) a little too perfectly.

monksy4y ago

That was both terrible and amazing.

tclancy4y ago

They figured that out on Ellis Island so yeah. Soundex.

dredmorbius4y ago

Are you saying Ellis Island used Soundex (that seems to check out) or devised it?

SergeAx4y ago

Ellis Island is the one with that big green statue. What it has in common with Soundex?

2 more replies

oars4y ago

That is oddly hilarious yet really dark at the same time. Thank you for sharing the story about the talking banana.

Lammy4y ago

But now it's anti-Chinese if you can't say 那個 :v

msdrigg4y ago· 10 in thread

Whats this one about?

    CREATE OR REPLACE FUNCTION is_blasphemy (VARCHAR) RETURNS BOOLEAN STABLE AS $$
     SELECT replace($1,'_','') SIMILAR TO '%p(o|0)rc(o|0)di(o|0)%'
     OR replace($1,'_','') SIMILAR TO '%p(o|0)rc(o|0)mad(o|0)nna%'
     $$ LANGUAGE SQL;

gary_04y ago

https://www.urbandictionary.com/define.php?term=porco%20dio

mensetmanusman4y ago

Sheesh

c) "Porco dio saranno mica i testimoni di Geova? No eh diocan digli che i signori sono fuori, non ho tempo per stargli dietro." ("Fuck, they can't be Jehovah's witnesses, can they? Tell them we're out, we don't have time for their shit.")

Igelau4y ago

That's a beautiful swear. Far from naughty, I feel enriched having learned this.

msdrigg4y ago

I tried googling it but I put it all in one word. Thanks for the help!

__void4y ago

is not even remotely complete, in italy we have two regions dedicated to the creation of blasphemies so advanced in ingenuity that two telephone books in regexp would not be enough to stop them

shadilay4y ago

That one was my favorite too. Filter lists don't work well in a multi lingual world. https://www.reddit.com/r/Rainbow6/comments/a01w7q/got_banned...

OJFord4y ago

Mine is `create or replace function is_tragedy`! (Not the tragedy itself of course, though I confess I'm unfamiliar, just that line specifically.)

alkz4y ago

that should actually be porcAmadonna

marton784y ago

Grazie.

cbsmith4y ago

Italian

prawn4y ago· 5 in thread

If you liked this chaos, you'd love my 15+ years of cobbled-together efforts at limiting forum spam and the like. I suddenly don't feel quite so alone in the myriad efforts needed to tackle this sort of thing.

anigbrowl4y ago

It's an endless arms race. I spend a lot of time studying unsavory people and a great deal of effort goes into the development of new dogwhistles that are designed to either provoke or connect with peers while maintaining deniability. You might like this paper on the evolutionary dynamics of covert social signaling: https://www.nature.com/articles/s41598-018-22926-1

prawn4y ago

I can really identify with that. My efforts are against spam and trolls; at least the spam is predictable and easier to take a sledgehammer to! Those grey-area trolls are such a miserable part of online content and moderation.

exporectomy4y ago

If they're covert and only members of the group using them know what they mean, then why fight it? Seems like a helpful way to communicate about things that others might not like to hear. The paper you linked said that too - "Such signals may allow coordination and enhanced cooperation while also avoiding the alienation or hostile reactions of individuals with different preferences.".

This happens all the time in cartoons that appeal to children but also contain subtle adult jokes so everyone can enjoy them on different levels.

1 more reply

legohead4y ago

At an online company I worked at we had various word filters for our forums, but new stuff was always popping up and getting through.

What worked in the end was having any newly created thread send a message containing the post title & body to a slack channel specifically for monitoring the forums. Employees and our forum moderators were in there, and any bad threads were nearly instantly deleted. Eventually the spammers mostly gave up. Hard to beat a dozen human brains :)

prawn4y ago

I get alerted to every new thread, and when I'm online my response rate is also very fast. When the message count was under 1,000/week, I used to get an email for every single post too. But if other moderators aren't around and I'm offline, I am out of luck. Shadow-banning can be effective. I also give regulars the ability to sin-bin any post which removes it from view and leaves it for me to check.

Having a dozen reviewers, especially if spread across timezones, would be a dream!

buildsjets4y ago· 5 in thread

My wife used to name her RPG characters “Isis” after a cat we used to have. Used to have to explain to vets that we named her Isis years before Bush and Cheney created Isis by starting their illegitimate war in Iraq.

mellavora4y ago

Better yet, my children's school had a parent support portal called Isis.

I have some fun emails whose subject line is "here are your ISIS family log-in details"

and this was right around the time the terrorist group was frequently in the news

ggrelet4y ago

Are people not familiar with the Egyptian goddess, though?

stragies4y ago

In 2021, hardly anybody remembers Isis to be the Goddess of Love who invented marriage

1 more reply

mikojan4y ago

Honestly, what a great way to start _that_ conversation haha

alentist4y ago

ISIS was founded in 1999.

The Iraq War began in 2003.

Bush's administration ended in 2009.

ISIS gained power in 2014, 5 years into Obama's administration.

oroul4y ago· 3 in thread

> LIKE '%aggin%'

Looks like "Baggins" is banned.

Poor Bilbo...

CobrastanJorji4y ago

I feel like you could make an interesting game out of this. Given these rules, find the best "false negative," a realistic and inoffensive, but banned username. My best so far are "brownie_gurl" and "Megasthenes."

b1124y ago

Sure, but...

The story we know was written by him. I wonder what the trolls would say, or the dead dragon, or the town that his actions helped destroy?

Is he a hero, or, just the guy who wrote it all down?

And just when something really important happens, he throws a powerful weapon(the one right) at his nephew and goes away to retire!

A life lead with riches (gold from the trolls), a ring granting him extremely long life and health, and yup.. off he goes, first sign of real trouble.

Poor Bilbo indeed!

tjpnz4y ago

So is Mike Hunt.

Poor Mike.

cookiengineer4y ago· 3 in thread

I wonder why they didn't go with syllables and something like a levenshtein distance to syllables?

That way they could more easily maintain similar looking and sounding words, including leetspeak and other variants.

Because with this kind of approach, something like "yolocaust" will get through, as most checks only go with exact matches and permutations will always get around it easily...

Whereas with something like a levenshtein distance you could compare it with a set of words and syllables and if it's too similar looking, e.g. 90% the same distance compared to username length, you could simply block it.

PrincessJessOP4y ago

Agreed. If it's even close to a no no word it should be banned completely. No more riggers, naggers, poggers, biggas, lucks, bunts, minks, bikes, or trikes. This is a doubleplusgood plan.

cookiengineer4y ago

Maybe usernames in general should require a higher entropy than words in a dictionary :D

guywhocodes4y ago

People how hate naggers don't like levenshtein either

trhway4y ago· 3 in thread

looks like it is limited to English only. There is whole world of non-English offense out there. Like using English characters to make national offensive words as well as using national characters to make English offensive words.

mjevans4y ago

English only

Should optimize checks with a de-obfuscation function (attempt to expand non-AZ back to AZ, even if that then shoots permutations at banned word-runs).

It should probably also look more like a spam scoring system, where really obvious stuff is hard-trashed but borderline things are flagged for review / discussion.

I am also very disturbed that, as with most censorship, 'obviously bad' things such as terrorism/etc are co-mingled with 'is adult' as a negative check.

It seems reasonable for Twitch to have validated 'safe for minors' areas where names are filtered. Generic areas, where things are in the gray area and unchecked. Adult Only areas, where swears, profanity, maybe even some of the hateful things are allowed. Informed consumer choice.

wdutch4y ago

It also includes a couple of Italian swearwords in the function `is_blasphemy`

Ichthypresbyter4y ago

And exactly one German one in 'is_profanity'

1 more reply

ancharm4y ago· 3 in thread

Is there a blog or something where someone is going through the dump and summarizing?

djrockstar14y ago

Your best bet atm is to just look through reddit/hn comments/posts people make as they find stuff. The leak's too big for one person/team to quickly find all spicy stuff.

PrincessJessOP4y ago

https://sizeof.cat/post/twitch-leaks/

PrincessJessOP4y ago

https://sizeof.cat/post/twitch-leaks/

LinuxBender4y ago· 2 in thread

This script feels like this is something that:

- should be in its own application with its own rules engine so you dont accidentally whack a bunch of userames

- I would have done in the past and cringe when people ask me to update it.

smoldesu4y ago

I feel bad for the Amazon employee who came into work one day with this project sitting on their desk.

woodruffw4y ago

My understanding from talking to {current,former} {Amazon,Twitch} employees is that Twitch has retained a decent amount of engineering independence. For better or worse, it's unlikely that some rando at Amazon ended up with this particular PHP file on their desk.

3 more replies

foota4y ago· 2 in thread

Man, I was _not_ expecting it to be stored procedures lmao.

dperalta4y ago

Well, is a good idea actually.

cube004y ago

As long you don't mind a cheeky production release to get new words urgently added once the next bot wave hits.

mdrzn4y ago· 2 in thread

Aw, why can't you be nicknamed 420blazeit?

That's not even offensive and there are still way more weed-nicknames available, so I don't get it.

teddyh4y ago

You can if you’re a magic fire sprite:

http://www.threepanelsoul.com/comic/picked-out

no1lives4ever4y ago

420 == marijuana, blazeit - self explainatory.. not everyone is fine with it..

monksy4y ago· 2 in thread

Oh classic Mike Hunt, poor guy.

TrackerFF4y ago

I see they got Mike Hawk, too. But Mike Litoris is still free to use his/their own name.

tait4y ago

We had a legitimate Michael Hunt at our school. He went by Mike.

ibudiallo4y ago· 2 in thread

I used to work for a company that automated customer service. While we were trialing with new a client, our ai service responded to a customer: "Hey B**, thanks for reaching out."

To our surprise, the end user did not feel offended at all. In fact, they were happy because we responded instantly instead of the usual 24 to 48 hours.

[Story]: https://idiallo.com/blog/do-you-make-your-customers-wait

_hilro4y ago

The word is bitch so the sanitised version would be B**

Why portray the bad word as 3 letters - it doesn't make sense.

OJFord4y ago

Ha, the same thing probably happened to GP as has happened to you - unescaped * characters (escape with `\`) resulting in two imperceptibly italic asterisks.

1 more reply

Hippocrates4y ago· 2 in thread

Doing this in SQL is absolute insanity. I’ve seen many pieces of code grow in this way and I understand how it happens but still surprises me to see how scrappy things are under the hood at some big-time companies.

who-shot-jr4y ago

Hello, what would be a better way to do it? In code? Regex?

Hippocrates4y ago

Probably yes, in code with regex. To name a few advantages: 1. Better readability and organization. Rules and word lists can be abstracted to more of a config format. 2. possible to easily store in a database and support dynamic additions/changes to the rules and words 3. Better accountability of performance. Able to use profiling to catch any perf issues in the rules. 4. Testable with unit tests

swman4y ago· 2 in thread

So what is a good practice for this kind of thing? I’ve got an app where users create a display name.

Is there a legit service I can use or some actual well tested library that can help me? I’m using node and Go so either languages.

vuciv14y ago

Give me a few weeks

swman4y ago

Are you making such a service?

busymom04y ago· 2 in thread

> is_marijuana

Looks like Twitch doesn't like weed.

anigbrowl4y ago

'Angry parents say gaming site Twitch promotes drug use, slow news day story at 10'

jrodthree244y ago

It's okay. Instead of watching a streamer called 420blazeit, they can watch Amouranth sucking on a microphone.

imnitishng4y ago· 2 in thread

Why is all of this implemented in SQL? Wouldn't it be better to do it in code with dedicated methods to filter stuff out? IMO logic inside of SQL queries just adds unnecessary complexity, implementing this in code would've been maintainable and testable.

dvh4y ago

Boss: "Hey Bob! People started spamming one of our boards, we are currently busy doing other things and cannot deploy new client, can you make filter with these 3 words and deploy it ASAP?"

Bob: "Sure no problem if it's only temporary"

--few months later

Chief architect:

"People are spamming more and more, we should design new system for these 234 new bad words, I need team of 7, two backend guys, 5 frontend guys and 4 weeks. It will also require minor rewrite of few external components."

Boss: "geez we're in the middle of sprint right now, Bob can you add these 234 words to existing filter? Make sure it's in production before lunch, thanks" (checks watches) "I have to go now, meeting with customer, bye".

bni4y ago

It is FUNCTIONS so it is code like any other. Even SQL is code (declarative).

The reason is: This can be updated very easily and out-of-band of other deploys of the main application code.

To make this work it has to be super agile to update. Probably this version we are looking at is an old version that happened to be put in git. The functions in prod probably have several additions since.

mgdlbp4y ago· 1 in thread

More high-profile profanity filters open for viewing: edit filters used in Wikimedia projects, which also together encompass many languages.

The main 'bad words' filter of English Wikipedia:

https://en.wikipedia.org/wiki/Special:AbuseFilter/384

The page (on en:wiki) listing all filters, which also have uses other than detecting abuse:

https://en.wikipedia.org/wiki/Special:AbuseFilter

The special page has the same title in other editions of Wikipedia and other Wikimedia wikis, though many filters are set to hidden. Dutch Wikipedia, for example:

https://nl.wikipedia.org/wiki/Special:AbuseFilter/10

cesarb4y ago

A more traditional one, older than these you linked: https://en.wikipedia.org/wiki/MediaWiki:Titleblacklist and https://meta.wikimedia.org/wiki/Title_blacklist

neither_color4y ago· 1 in thread

It seems NVIDIAMINER could be controversial.

iab4y ago

Language friend, this is a family site

NiceWayToDoIT4y ago· 1 in thread

If "fork" and "shirt" are allowed I am fine. :)

You can still use any other language except English to achieve the goal .

appel4y ago

Unexpected "The Good Place" :)

legostormtroopr4y ago· 1 in thread

Poor Bilbo and Frodo - they'll never be allowed to use Twitch:

CREATE OR REPLACE FUNCTION is_hateful (VARCHAR) RETURNS BOOLEAN STABLE AS $$ SELECT ... OR replace($1,'_','') LIKE '%aggin%'

jodrellblank4y ago

Or designer of the first commercial microprocessor, Intel's 4004 https://en.wikipedia.org/wiki/Federico_Faggin

aasasd4y ago· 1 in thread

Ah, so that's the Word Filter that Twitch directly mentioned as the majority of their effort to fight hate comments (as cited in the main thread on the leak).

Also, Lisa Pedro is not welcome there.

viraptor4y ago

Let's not jump to conclusions. We don't know where this specific filter was used, is it an automated blocker or just flagging for moderation, is it the only layer, or is it even the currently used one.

lionkor4y ago· 1 in thread

So... This ignores lookalike unicode from other languages? Or does SQL know that, for example, `c` (ascii c) looks like `с` (cyrillic s)?

input_sh4y ago

It only allows English characters to begin with. It doesn't even do Latin Extended characters like čćšđ, let alone non-Latin.

DethNinja4y ago· 1 in thread

Why not just convert numbers like 1 to i or l then check with a manually created bad word list?

Would regex be really much faster than checking it against a 1000 or more bad word list?

Also bad word list can easily get updated by moderators as well, I really can’t understand the logic behind using so much regex.

cbsmith4y ago

A bad words list is a regex. ;-)

But note: for the most part this isn't using regexs, and to the extent it does, it seems largely intended to make the maintainers' lives easier by avoiding having to represent (and maintain) all the permutations they are trying to match for.

What's sad though is that they're doing many, many passes through the pattern matcher, rather than just building a single big DFA from the whole list of patterns they want to match, which gets traversed in one pass.

2 more replies

PrincessJessOP4y ago· 1 in thread

Path in the leak is: safety-ml\offensive-usernames\data_pull\sql\bad.sql

Highlights:

    CREATE OR REPLACE FUNCTION is_tragedy (VARCHAR) RETURNS BOOLEAN STABLE AS $$
     SELECT replace($1,'_','') LIKE '%george%floyd%'
    $$ LANGUAGE SQL;

    CREATE OR REPLACE FUNCTION is_derogatory (VARCHAR) RETURNS BOOLEAN STABLE AS $$
     SELECT replace($1,'_','') LIKE '%retard%'
    $$ LANGUAGE SQL;

Now we know we can make as many goergefloyd accounts as we want! *devious grin* *chuckles to self*

bombcar4y ago

safety-ml - I assume this is some attempt at training a machine learning algorithm to find bad names.

1 more reply

Iv4y ago

Real story: I needed to register an instagram account a few months ago. I tried my nick "Iv", of course it was unavailable. I tried many different alternatives and finally managed to get "Iv but the real Iv" or ivbuttherealiv. Recently I saw my account was banned. I then realized that there was a "butt" in there.

Now I will never have an IG account linked to my FB. Oh well, I can live without that but thank god I don't depend on that platform for business. It made me laugh but there is zero appeal available.

broahmed4y ago

Glad to see the word Niger (as in the country) wasn’t blocked. Ran into this with Venmo a little while ago and they thankfully backpedaled. https://news.ycombinator.com/item?id=24042742

for1nner4y ago

Amusing to see the is_hateful query which used to just be an unreadable mess of regex for validation. Clearly the infra has grown/changed over 10 years, and this is obviously neater, but boy does it LOOK a lot bigger now (and I'm sure they've added yet more new words that try to hurt people...)

cghendrix4y ago

Reminds me of the black words service at apple that checks for explicit or non printer friendly stuff for iPhone engraving. It was hilarious looking at the advanced linguistic engine they developed to filter out asshole and 30+ variations of it with snarky comments added by devs in the past.

JadoJodo4y ago

The movie Gran Torino came out when I was twenty, and in seeing it, I heard many racial slurs uttered in context for the first time. I was, of course, familiar with the "main" ones, but the one in particular that I remember was "spook". I remember it vividly because, upon hearing it, I had an immediate realization that I had heard it before when it was said near the end of Back To The Future (the scene where Marvin Berry and the Starlighters chase off Biff's gang).

In my youthful innocence I had assumed the characters in the movie were trading silly, made-up names (e.g. "Who are you calling "sparky", Popeye?"), not racial epithets.

The fact that this list needs to exist makes me sad, but I am glad technology can assist on some level with the issue.

inagiledev4y ago

I feel sad for Mike Hunt, who can't use his name for a username :)

oxplot4y ago

This is pretty much the same problem as Email spam. There needs to be a service/collaborative project to filter these, instead of each app hacking up an ad-hoc way of doing it.

brokenwren4y ago

We tried to convince Twitch for years that their filters were garbage and they should use CleanSpeak. They kept insisting their engineering team had written the best filter in the world.

Sometimes it’s just not worth the effort to try and help people solve these problems.

wiradikusuma4y ago

There should be an open source or even Official RFC(tm) for this. The use case is very generic.

darepublic4y ago

Interesting to see this. These are like the ten commandments, primitive yes/no rules that reflect the people who came up with them. A world of nuance is missing obviously, not to mention gaps that can be gamed. How do you express ethics in code

leokennis4y ago

Potential solution for this;

- Have people choose any username they want

- Have an “after the fact” human review system on usernames

- If the username is inappropriate, change it to “smallsausage[0-9]+” without the option of reverting it back or requesting a new user on the same e-mail address.

Fiahil4y ago

So, it's only working for english and english-related profanity ? Interesting.

qayxc4y ago

A clbuttic [0] solution to the problem.

[0] https://en.wiktionary.org/wiki/clbuttic

amai4y ago

See https://en.m.wikipedia.org/wiki/Scunthorpe_problem

gavinray4y ago

Sucks to be you, if your name is "Kyle", and you just wanted people to watch you play.

Looks like "SeeKylePlay" would trigger an insta-ban for "Sieg-Heil". F.

rzwitserloot4y ago

They included 'mike hawk' (say it out loud...)? If that's on the list, there are a few thousand other auditory joke names that should have been on here.

mokarma4y ago

This SQL file could be an entire AI-based, ML-driven startup to stop hateful / offensive language. Just need a good name like Klean or, NoH8 or something.

dariosalvi784y ago

I love the fact that blasphemy is only an Italian problem

Igelau4y ago

Whoever leaked this is going to get buttbuttinated for sure.

I have a hard time believing this silly mess is an actual component of anything.

OneTimePetes4y ago

Its surprising how good solid best effort parsing is by the human visual system.

sukciD suggiB was here..

dredmorbius4y ago

Prejudice.

Just six seemingly harmless letters arranged in a way to form a word with more power than the pieces of metal which is forged to make swords.

Just a couple of G's, an R and an E, an I and an N....

https://youtube.com/watch?v=KVN_0qvuhhw

jcun41284y ago

Was surprised it's SQL but guess that makes sense/faster to be done there

whomeIamme4y ago

Love how there is nothing in there to prevent white cracker from being said.

crorella4y ago

why the DB? Why not the client itself? How much data was wasted back and forth between the client and the backend? how many CPU cicles were spent in that poorly optimized SQL statement?

chakintosh4y ago

%nekker%

I pity those who chose that username based on The Witcher's Nekkers.

seany4y ago

I really wish we could get away from enforcing this type of crap

revicon4y ago

I had to google "mike hawk" before I got it.

deepstack4y ago

would have being better to have the data source in a json/csv formate, like an array of RegExpression. SQL is not compatible with different dbs.

vmception4y ago

these are so meme-able

j / k navigate · click thread line to collapse

337 comments

204 comments · 57 top-level

phamilton4y ago· 29 in thread

beachy4y ago

> to make sure random base64 didn't contain profanity

I would have said "why bother" until this happened to us.

eps4y ago

I have a bus ticket from Stockholm with the serial of F4CK. Totally made my day back then to be honest.

2 more replies

dmd4y ago

I was pretty happy when my randomly generated imgur URL of a photo of a telephone cake I made ended up being xxTEL.

https://imgur.com/gallery/xXtEL

unknownOrigin4y ago

> I would have said "why bother" until this happened to us.

1 more reply

shpx4y ago

https://news.ycombinator.com/item?id=28676096

boyter4y ago

Reminds me of this great story https://thedailywtf.com/articles/The-Automated-Curse-Generat...

resonious4y ago

randac4y ago

I got a CAPTCHA variant not long ago that was along the lines of "U2KYS". Pretty toxic!

1 more reply

foldr4y ago

ademarre4y ago

When it comes to censoring randomly generated strings, I like simply to omit vowels from the alphabet. Usually I'll omit some of the more obvious lookalikes too, e.g. [1 0 v].

It's a simple solution. Sure, it is still possible for something to slip through that looks similar to something bad. But the potential to strongly offend is greatly reduced.

cratermoon4y ago

Also note that if you're too naive about checking for 'naughty' words, you get https://en.wikipedia.org/wiki/Scunthorpe_problem

2 more replies

xdfgh11124y ago

Gov.uk did this. The code WNKR still ended up on Reddit yesterday.

1 more reply

tapland4y ago

Removing wovels would increase likelyhood of bad word combos, but removing consonants would have the desired effect.

smohare4y ago

It’s so laughable that we care about whether a generated string contains some temporally relevant profanity. We truly are still barbarians, and will be viewed as such by history.

gregoriol4y ago

It highly depends on what the purpose of the string is: if it can be clearly seen by the user, has to be read, or worse typed, then it's not just a random string in a database.

If the string is a url, imagine sending https://somesite/wanker to your client, when it actually could also be https://somesite/ay3ugd

tsimionescu4y ago

Biganon4y ago

It's not that stupid. People will send the shortened link to other people, who might not understand that the string was randomly generated.

1 more reply

Veen4y ago

If you generate an identifier for an important client that contains "knobhead," they won't think it's a randomly generated string, but that someone at your company is deliberately insulting them.

pchristensen4y ago

We did a similar thing at Groupon after a customer’s coupon code contained an F bomb.

EvanAnderson4y ago

I removed the letter U from a random password generator for a Customer's app after a password was generated containing the "C-word".

2 more replies

mcintyre19944y ago

It worked surprisingly well when we used it.

geoduck144y ago

>the first one that popped up during the demo was a big ole F bomb

And this, ladies and gentlemen, is what it would show BEFORE the filter... but after (runs the code again, and prays it works) ... NO PROFANITY!

phamilton4y ago

To be clear, this was the list of "things we blocked".

"Had we not done this work, that link would have been sent out to one of our users." was very well received.

delsarto4y ago

CRConrad4y ago

1 more reply

omegalulw4y ago

phamilton4y ago

Just trying again is absolutely fine.

Waterluvian4y ago

Fascinating! I guess an easy solution is to inject non alpha characters into any generated string. I imagine a constraint was that you wanted them to be easy to type?

phamilton4y ago

1 more reply

marcus_holmes4y ago· 27 in thread

I'm interested to see they still haven't solved this [1]

[1] https://metro.co.uk/2015/02/20/woman-refused-sainsburys-nect...

qwerty4561274y ago

marcus_holmes4y ago

We used to be taught that "sticks and stones may break my bones but words will never hurt me".

I'm not so sure that has ever been true. Emotional bullying was always a thing, and it did hurt. I'm glad that we're now taking that more seriously.

There's a balance in there. I suspect that this balance is what we call "good manners" or "politeness".

4 more replies

kypro4y ago

I'm quite a feminine guy and I was always called various homophobic slurs at school. Even though I wasn't gay I didn't understand why it would matter if I was so I would just say "ok".

6 more replies

secondaryacct4y ago

What s rude is the shared imagery not the words. If I call myself MyDirtyDickInYourDogBringsMeToOrgasm, imagining this scene is disgusting enough to ask me to change it.

Words are not just symbols on paper, they are image triggers in your brain.

2 more replies

dynamite-ready4y ago

1 more reply

Angostura4y ago

> No word can actually harm you directly.

"Sticks and stones may break my bones, but words will never hurt me" is something told to children in the playground who are too young to know better.

You might as well argue that there's no such thing as malware, because its entirely harmless until executed on a processor.

Cthulhu_4y ago

So "get over it" is your solution? Sounds like you've never been a victim. Place yourself in someone else's shoes please.

I mean I don't like the (self) censorship either and don't mind rude or mature words, since we're all mature here and calling a fucking asshole a f*ing a$$h0le is an insult to people's intellect.

1 more reply

imglorp4y ago

> Uhura : But why should I object to that term, sir? You see, in our century, we've learned not to fear words.

1 more reply

raxxorrax4y ago

I think words can hurt people in certain contexts, but it is very rarely about a single term.

noneeeed4y ago

dkdbejwi3834y ago

chx4y ago

newforms4y ago

Little Bobby Tables struggles

exar08154y ago

Its a similar problem to people having names that are considered "fake". I shudder what people actually named Harry Potter or James Bond have to go through regularly.

marcus_holmes4y ago

raxxorrax4y ago

Seems like a rude name might give you some privacy advantages in the 21st century. To the price of being excluded by random services maybe.

Old math prof of mine was named Dr. Cock, and he wasn't the only one, the name was common. Students called him Dr. Octocock.

st_goliath4y ago

mcintyre19944y ago

cauliflower994y ago

georgyo4y ago

I only recently learned about Scunthorpe because the username I've been using for 15 years is now affected.

Georgyo has orgy right in the name. More and more places are refusing to accept it when signing up.

OJFord4y ago

> We used to call this the "Scunthorpe problem"

Fwiw, it's well-known as such, after relatively high-profile incidents involving people with that address: https://en.wikipedia.org/wiki/Scunthorpe_problem

marcus_holmes4y ago

Is that just well-known in the UK though? I've mentioned this to people outside the UK and only occasionally have they known it.

3 more replies

chanandler_bong4y ago

The city of Toppenish in the US had the same problem when the council turned on a generic filtering system for the city networks. Everything stopped working.

chrisdengso4y ago

Haha, I call this the "Clbuttic" problem. But nowadays it can be solved with machine learning fairly easy https://moderationapi.com/blog/moderate-text-automatically-u...

marcus_holmes4y ago

That's a slightly different problem from the Scunthorpe problem.

You have a customer who gives her name as "Fanny Batter". Is that a permissible name?

What about "Juan Kerr"? Or "Amanda Huggenkiss"?

It's not about replacing "ass" with "butt", but detecting rudeness/humour in context.

I'd be amazed if machine learning can solve this one. It's extremely hard for humans.

* all those names should be rejected btw

1 more reply

xsmasher4y ago

The clbuttic problem involves replacement; original source is (AFAIK) here:

https://thedailywtf.com/articles/The-Clbuttic-Mistake-

globular-toast4y ago

If nobody ever implemented a rude word filter, nobody would ever need to implement a rude word filter.

rhema4y ago· 22 in thread

It does remind me of the XKEYSCORE (Snowden leaks) that used keywords to bubble up potential threats from emails etc https://www.businessinsider.com/nsa-prism-keywords-for-domes... .

ridaj4y ago

OskarS4y ago

2 more replies

notjustanymike4y ago

I see the same here. It's not clever, but no one has any doubt what words are being checked.

baud1472584y ago

it's missing a few slurs, I'm not sure it gets updated (or it's a filter elsewhere which gets updated)

throwaway9843934y ago

It seems you are assuming that software is usually written well, or as well as it can be. It's much more likely to be the opposite.

MattGaiser4y ago

This seems like the kind of thing that would be horribly specced and be a user story along the lines of "the user must not be allowed to make an inappropriate username."

The engineer would write something for every test case the product manager complained about, anything else computationally easy, and call it a day.

Same deal. Both exist mostly for compliance.

1 more reply

jldugger4y ago

Would be unsurprised if some poor engineer got assigned the project, realized it was an untractable mess of scunthorpe, and decided to check some boxes and move on to some ticket of higher value.

mc324y ago

It also mostly checks for English naughty words and not much else. People can have fun in lots of other languages, so it would seem this is a small sample.

unnouinceput4y ago

Igelau4y ago

Some near the end looked like they might be in another language, but I won't be the one to find out.

It used to be "if I search for this term, am I accidentally going to wind up getting goatse or something?" The good old days.

Now it's "if I search for this term, is the FBI going to kick my door in?"

2 more replies

Agentlien4y ago

1 more reply

AmericanChopper4y ago

And specifically Italian blasphemy for some reason…

1 more reply

formerly_proven4y ago

Bunch of ineffective entries too, all patterns containing underscores won't ever match.

1 more reply

ridaj4y ago

Is Twitch's footprint big enough internationally that they have to worry about it?

gowld4y ago

That list is a list of words chosen by William Knowles to taunt any NSA who may be listening.

It's not a list of words used by the NSA or any spies. https://attrition.org/misc/keywords.html

geoduck144y ago

$Username=<username>

$Password=<password>

Connection.string=($Username, $Password)

The parser would flag this - Password was being stored to a variable! So we just changed our code:

$pw=<Password>. Problem solved!

comrh4y ago

I can't find the article now but the developer who wrote these scripts said they were a singular effort from years ago before security was taken over by a more formal development team.

woodruffw4y ago

Someone in another thread mentioned that these might be part of corpus generation for an ML model. That would make more sense to me.

gipp4y ago

Based on the filepath given in this very thread, it seems plain that that's the case (safety-ml\offensive-usernames\data_pull\sql\bad.sql).

cbsmith4y ago

Training an ML model to "learn" a rules engine strikes me as an incredibly bad practice. It'd make more sense to just have an actual corpus of labeled data.

makomk4y ago

b1124y ago

If valid, this means virtually every phone call, and every email (with clients) is flagged. :P

This must be more nuanced. Maybe, it's "additional algo processing when a word is hit", eg another layer before "involve human".

sorenjan4y ago· 11 in thread

https://youtu.be/bJ5ppf0po3k?t=715

geoduck144y ago

>the guy that streamed a talking banana on Twitch

Of all the ways I expected a talking banana to backfire, I didn't expect this one. Thanks for sharing

pjc504y ago

Anything exposed to the open internet devolves into porn or racism or both unless active effort is made to prevent it. I'm reminded of https://en.wikipedia.org/wiki/Tay_(bot)

1 more reply

LadyCailin4y ago

Wait, now I’m extremely curious what failure modes one would expect from a talking banana.

2 more replies

rzwitserloot4y ago

grumpwagon4y ago

From the video: "It was a mix of racism and creativity"

That sums up the WoW Classic (and I'm sure many other gaming communities) a little too perfectly.

monksy4y ago

That was both terrible and amazing.

tclancy4y ago

They figured that out on Ellis Island so yeah. Soundex.

dredmorbius4y ago

Are you saying Ellis Island used Soundex (that seems to check out) or devised it?

SergeAx4y ago

Ellis Island is the one with that big green statue. What it has in common with Soundex?

2 more replies

oars4y ago

That is oddly hilarious yet really dark at the same time. Thank you for sharing the story about the talking banana.

Lammy4y ago

But now it's anti-Chinese if you can't say 那個 :v

msdrigg4y ago· 10 in thread

Whats this one about?

    CREATE OR REPLACE FUNCTION is_blasphemy (VARCHAR) RETURNS BOOLEAN STABLE AS $$
     SELECT replace($1,'_','') SIMILAR TO '%p(o|0)rc(o|0)di(o|0)%'
     OR replace($1,'_','') SIMILAR TO '%p(o|0)rc(o|0)mad(o|0)nna%'
     $$ LANGUAGE SQL;

gary_04y ago

https://www.urbandictionary.com/define.php?term=porco%20dio

mensetmanusman4y ago

Sheesh

Igelau4y ago

That's a beautiful swear. Far from naughty, I feel enriched having learned this.

msdrigg4y ago

I tried googling it but I put it all in one word. Thanks for the help!

__void4y ago

is not even remotely complete, in italy we have two regions dedicated to the creation of blasphemies so advanced in ingenuity that two telephone books in regexp would not be enough to stop them

shadilay4y ago

That one was my favorite too. Filter lists don't work well in a multi lingual world. https://www.reddit.com/r/Rainbow6/comments/a01w7q/got_banned...

OJFord4y ago

Mine is `create or replace function is_tragedy`! (Not the tragedy itself of course, though I confess I'm unfamiliar, just that line specifically.)

alkz4y ago

that should actually be porcAmadonna

marton784y ago

Grazie.

cbsmith4y ago

Italian

prawn4y ago· 5 in thread

anigbrowl4y ago

prawn4y ago

exporectomy4y ago

This happens all the time in cartoons that appeal to children but also contain subtle adult jokes so everyone can enjoy them on different levels.

1 more reply

legohead4y ago

At an online company I worked at we had various word filters for our forums, but new stuff was always popping up and getting through.

prawn4y ago

Having a dozen reviewers, especially if spread across timezones, would be a dream!

buildsjets4y ago· 5 in thread

mellavora4y ago

Better yet, my children's school had a parent support portal called Isis.

I have some fun emails whose subject line is "here are your ISIS family log-in details"

and this was right around the time the terrorist group was frequently in the news

ggrelet4y ago

Are people not familiar with the Egyptian goddess, though?

stragies4y ago

In 2021, hardly anybody remembers Isis to be the Goddess of Love who invented marriage

1 more reply

mikojan4y ago

Honestly, what a great way to start _that_ conversation haha

alentist4y ago

ISIS was founded in 1999.

The Iraq War began in 2003.

Bush's administration ended in 2009.

ISIS gained power in 2014, 5 years into Obama's administration.

oroul4y ago· 3 in thread

> LIKE '%aggin%'

Looks like "Baggins" is banned.

Poor Bilbo...

CobrastanJorji4y ago

b1124y ago

Sure, but...

The story we know was written by him. I wonder what the trolls would say, or the dead dragon, or the town that his actions helped destroy?

Is he a hero, or, just the guy who wrote it all down?

And just when something really important happens, he throws a powerful weapon(the one right) at his nephew and goes away to retire!

A life lead with riches (gold from the trolls), a ring granting him extremely long life and health, and yup.. off he goes, first sign of real trouble.

Poor Bilbo indeed!

tjpnz4y ago

So is Mike Hunt.

Poor Mike.

cookiengineer4y ago· 3 in thread

I wonder why they didn't go with syllables and something like a levenshtein distance to syllables?

That way they could more easily maintain similar looking and sounding words, including leetspeak and other variants.

Because with this kind of approach, something like "yolocaust" will get through, as most checks only go with exact matches and permutations will always get around it easily...

PrincessJessOP4y ago

Agreed. If it's even close to a no no word it should be banned completely. No more riggers, naggers, poggers, biggas, lucks, bunts, minks, bikes, or trikes. This is a doubleplusgood plan.

cookiengineer4y ago

Maybe usernames in general should require a higher entropy than words in a dictionary :D

guywhocodes4y ago

People how hate naggers don't like levenshtein either

trhway4y ago· 3 in thread

mjevans4y ago

English only

Should optimize checks with a de-obfuscation function (attempt to expand non-AZ back to AZ, even if that then shoots permutations at banned word-runs).

It should probably also look more like a spam scoring system, where really obvious stuff is hard-trashed but borderline things are flagged for review / discussion.

I am also very disturbed that, as with most censorship, 'obviously bad' things such as terrorism/etc are co-mingled with 'is adult' as a negative check.

wdutch4y ago

It also includes a couple of Italian swearwords in the function `is_blasphemy`

Ichthypresbyter4y ago

And exactly one German one in 'is_profanity'

1 more reply

ancharm4y ago· 3 in thread

Is there a blog or something where someone is going through the dump and summarizing?

djrockstar14y ago

Your best bet atm is to just look through reddit/hn comments/posts people make as they find stuff. The leak's too big for one person/team to quickly find all spicy stuff.

PrincessJessOP4y ago

https://sizeof.cat/post/twitch-leaks/

PrincessJessOP4y ago

https://sizeof.cat/post/twitch-leaks/

LinuxBender4y ago· 2 in thread

This script feels like this is something that:

- should be in its own application with its own rules engine so you dont accidentally whack a bunch of userames

- I would have done in the past and cringe when people ask me to update it.

smoldesu4y ago

I feel bad for the Amazon employee who came into work one day with this project sitting on their desk.

woodruffw4y ago

3 more replies

foota4y ago· 2 in thread

Man, I was _not_ expecting it to be stored procedures lmao.

dperalta4y ago

Well, is a good idea actually.

cube004y ago

As long you don't mind a cheeky production release to get new words urgently added once the next bot wave hits.

mdrzn4y ago· 2 in thread

Aw, why can't you be nicknamed 420blazeit?

That's not even offensive and there are still way more weed-nicknames available, so I don't get it.

teddyh4y ago

You can if you’re a magic fire sprite:

http://www.threepanelsoul.com/comic/picked-out

no1lives4ever4y ago

420 == marijuana, blazeit - self explainatory.. not everyone is fine with it..

monksy4y ago· 2 in thread

Oh classic Mike Hunt, poor guy.

TrackerFF4y ago

I see they got Mike Hawk, too. But Mike Litoris is still free to use his/their own name.

tait4y ago

We had a legitimate Michael Hunt at our school. He went by Mike.

ibudiallo4y ago· 2 in thread

I used to work for a company that automated customer service. While we were trialing with new a client, our ai service responded to a customer: "Hey B**, thanks for reaching out."

To our surprise, the end user did not feel offended at all. In fact, they were happy because we responded instantly instead of the usual 24 to 48 hours.

[Story]: https://idiallo.com/blog/do-you-make-your-customers-wait

_hilro4y ago

The word is bitch so the sanitised version would be B**

Why portray the bad word as 3 letters - it doesn't make sense.

OJFord4y ago

Ha, the same thing probably happened to GP as has happened to you - unescaped * characters (escape with `\`) resulting in two imperceptibly italic asterisks.

1 more reply

Hippocrates4y ago· 2 in thread

who-shot-jr4y ago

Hello, what would be a better way to do it? In code? Regex?

Hippocrates4y ago

swman4y ago· 2 in thread

So what is a good practice for this kind of thing? I’ve got an app where users create a display name.

Is there a legit service I can use or some actual well tested library that can help me? I’m using node and Go so either languages.

vuciv14y ago

Give me a few weeks

swman4y ago

Are you making such a service?

busymom04y ago· 2 in thread

> is_marijuana

Looks like Twitch doesn't like weed.

anigbrowl4y ago

'Angry parents say gaming site Twitch promotes drug use, slow news day story at 10'

jrodthree244y ago

It's okay. Instead of watching a streamer called 420blazeit, they can watch Amouranth sucking on a microphone.

imnitishng4y ago· 2 in thread

dvh4y ago

Boss: "Hey Bob! People started spamming one of our boards, we are currently busy doing other things and cannot deploy new client, can you make filter with these 3 words and deploy it ASAP?"

Bob: "Sure no problem if it's only temporary"

--few months later

Chief architect:

bni4y ago

It is FUNCTIONS so it is code like any other. Even SQL is code (declarative).

The reason is: This can be updated very easily and out-of-band of other deploys of the main application code.

mgdlbp4y ago· 1 in thread

More high-profile profanity filters open for viewing: edit filters used in Wikimedia projects, which also together encompass many languages.

The main 'bad words' filter of English Wikipedia:

https://en.wikipedia.org/wiki/Special:AbuseFilter/384

The page (on en:wiki) listing all filters, which also have uses other than detecting abuse:

https://en.wikipedia.org/wiki/Special:AbuseFilter

The special page has the same title in other editions of Wikipedia and other Wikimedia wikis, though many filters are set to hidden. Dutch Wikipedia, for example:

https://nl.wikipedia.org/wiki/Special:AbuseFilter/10

cesarb4y ago

A more traditional one, older than these you linked: https://en.wikipedia.org/wiki/MediaWiki:Titleblacklist and https://meta.wikimedia.org/wiki/Title_blacklist

neither_color4y ago· 1 in thread

It seems NVIDIAMINER could be controversial.

iab4y ago

Language friend, this is a family site

NiceWayToDoIT4y ago· 1 in thread

If "fork" and "shirt" are allowed I am fine. :)

You can still use any other language except English to achieve the goal .

appel4y ago

Unexpected "The Good Place" :)

legostormtroopr4y ago· 1 in thread

Poor Bilbo and Frodo - they'll never be allowed to use Twitch:

CREATE OR REPLACE FUNCTION is_hateful (VARCHAR) RETURNS BOOLEAN STABLE AS $$ SELECT ... OR replace($1,'_','') LIKE '%aggin%'

jodrellblank4y ago

Or designer of the first commercial microprocessor, Intel's 4004 https://en.wikipedia.org/wiki/Federico_Faggin

aasasd4y ago· 1 in thread

Ah, so that's the Word Filter that Twitch directly mentioned as the majority of their effort to fight hate comments (as cited in the main thread on the leak).

Also, Lisa Pedro is not welcome there.

viraptor4y ago

lionkor4y ago· 1 in thread

So... This ignores lookalike unicode from other languages? Or does SQL know that, for example, `c` (ascii c) looks like `с` (cyrillic s)?

input_sh4y ago

It only allows English characters to begin with. It doesn't even do Latin Extended characters like čćšđ, let alone non-Latin.

DethNinja4y ago· 1 in thread

Why not just convert numbers like 1 to i or l then check with a manually created bad word list?

Would regex be really much faster than checking it against a 1000 or more bad word list?

Also bad word list can easily get updated by moderators as well, I really can’t understand the logic behind using so much regex.

cbsmith4y ago

A bad words list is a regex. ;-)

2 more replies

PrincessJessOP4y ago· 1 in thread

Path in the leak is: safety-ml\offensive-usernames\data_pull\sql\bad.sql

Highlights:

    CREATE OR REPLACE FUNCTION is_tragedy (VARCHAR) RETURNS BOOLEAN STABLE AS $$
     SELECT replace($1,'_','') LIKE '%george%floyd%'
    $$ LANGUAGE SQL;

    CREATE OR REPLACE FUNCTION is_derogatory (VARCHAR) RETURNS BOOLEAN STABLE AS $$
     SELECT replace($1,'_','') LIKE '%retard%'
    $$ LANGUAGE SQL;

Now we know we can make as many goergefloyd accounts as we want! *devious grin* *chuckles to self*

bombcar4y ago

safety-ml - I assume this is some attempt at training a machine learning algorithm to find bad names.

1 more reply

Iv4y ago

Now I will never have an IG account linked to my FB. Oh well, I can live without that but thank god I don't depend on that platform for business. It made me laugh but there is zero appeal available.

broahmed4y ago

Glad to see the word Niger (as in the country) wasn’t blocked. Ran into this with Venmo a little while ago and they thankfully backpedaled. https://news.ycombinator.com/item?id=24042742

for1nner4y ago

cghendrix4y ago

JadoJodo4y ago

In my youthful innocence I had assumed the characters in the movie were trading silly, made-up names (e.g. "Who are you calling "sparky", Popeye?"), not racial epithets.

The fact that this list needs to exist makes me sad, but I am glad technology can assist on some level with the issue.

inagiledev4y ago

I feel sad for Mike Hunt, who can't use his name for a username :)

oxplot4y ago

This is pretty much the same problem as Email spam. There needs to be a service/collaborative project to filter these, instead of each app hacking up an ad-hoc way of doing it.

brokenwren4y ago

We tried to convince Twitch for years that their filters were garbage and they should use CleanSpeak. They kept insisting their engineering team had written the best filter in the world.

Sometimes it’s just not worth the effort to try and help people solve these problems.

wiradikusuma4y ago

There should be an open source or even Official RFC(tm) for this. The use case is very generic.

darepublic4y ago

leokennis4y ago

Potential solution for this;

- Have people choose any username they want

- Have an “after the fact” human review system on usernames

- If the username is inappropriate, change it to “smallsausage[0-9]+” without the option of reverting it back or requesting a new user on the same e-mail address.

Fiahil4y ago

So, it's only working for english and english-related profanity ? Interesting.

qayxc4y ago

A clbuttic [0] solution to the problem.

[0] https://en.wiktionary.org/wiki/clbuttic

amai4y ago

See https://en.m.wikipedia.org/wiki/Scunthorpe_problem

gavinray4y ago

Sucks to be you, if your name is "Kyle", and you just wanted people to watch you play.

Looks like "SeeKylePlay" would trigger an insta-ban for "Sieg-Heil". F.

rzwitserloot4y ago

They included 'mike hawk' (say it out loud...)? If that's on the list, there are a few thousand other auditory joke names that should have been on here.

mokarma4y ago

This SQL file could be an entire AI-based, ML-driven startup to stop hateful / offensive language. Just need a good name like Klean or, NoH8 or something.

dariosalvi784y ago

I love the fact that blasphemy is only an Italian problem

Igelau4y ago

Whoever leaked this is going to get buttbuttinated for sure.

I have a hard time believing this silly mess is an actual component of anything.

OneTimePetes4y ago

Its surprising how good solid best effort parsing is by the human visual system.

sukciD suggiB was here..

dredmorbius4y ago

Prejudice.

Just six seemingly harmless letters arranged in a way to form a word with more power than the pieces of metal which is forged to make swords.

Just a couple of G's, an R and an E, an I and an N....

https://youtube.com/watch?v=KVN_0qvuhhw

jcun41284y ago

Was surprised it's SQL but guess that makes sense/faster to be done there

whomeIamme4y ago

Love how there is nothing in there to prevent white cracker from being said.

crorella4y ago

why the DB? Why not the client itself? How much data was wasted back and forth between the client and the backend? how many CPU cicles were spent in that poorly optimized SQL statement?

chakintosh4y ago

%nekker%

I pity those who chose that username based on The Witcher's Nekkers.

seany4y ago

I really wish we could get away from enforcing this type of crap

revicon4y ago

I had to google "mike hawk" before I got it.

deepstack4y ago

would have being better to have the data source in a json/csv formate, like an array of RegExpression. SQL is not compatible with different dbs.

vmception4y ago

these are so meme-able

j / k navigate · click thread line to collapse