I would have said "why bother" until this happened to us.
A customer rang us up in a fury because some demo/ random data that we generated happened to have the word "penis" in it. They were convinced we must have put it there because we thought he was a cock. It was very difficult to defuse the situation.
Aah, the good ole "one customer is unhappy, let's waste a week of time on this" approach to IT management. Takes guts to tell such customers "here's your refund, now piss off", but it is the right thing to do.
It's a simple solution. Sure, it is still possible for something to slip through that looks similar to something bad. But the potential to strongly offend is greatly reduced.
Also note that if you're too naive about checking for 'naughty' words, you get https://en.wikipedia.org/wiki/Scunthorpe_problem
If the string is a url, imagine sending https://somesite/wanker to your client, when it actually could also be https://somesite/ay3ugd
It worked surprisingly well when we used it.
And this, ladies and gentlemen, is what it would show BEFORE the filter... but after (runs the code again, and prays it works) ... NO PROFANITY!
"Had we not done this work, that link would have been sent out to one of our users." was very well received.
Our main concern was whether we needed to increase the size to 26 to account for the loss of keys. After doing the math, a 25 digit random string has a ~5% chance of containing one of 150 three or four character inappropriate substrings. That 5% loss isn't that big of a deal. But we had to figure out the math as part of due diligence before shipping.
I worked on the application form processing for the Nectar card launch in the UK back in the early 2000's, and we had several cases. Luckily we had human data-entry clerks in the loop, so all we had to do was flag when a name contained anything on the "Scunthorpe list" and get a human to look at it. Even then it wasn't perfect and a few slipped through. One of the early PR messes of the launch was someone getting a Nectar card issued with a rude name, and of course they immediately went to the press with it [0]
I'm interested to see they still haven't solved this [1]
[0] I never saw the journalistic interest in this: this guy said his name was <rude name>, and then signed the form to say everything on the form was true. Why is anyone surprised or interested that we accepted his name was what he said it was and gave him a card in that name?
[1] https://metro.co.uk/2015/02/20/woman-refused-sainsburys-nect...
I'm not so sure that has ever been true. Emotional bullying was always a thing, and it did hurt. I'm glad that we're now taking that more seriously.
But it's also true that my emotions are my responsibility, and my emotional reaction to words is unique to me. I cannot force everyone else to be responsible for how I feel about their words. And I cannot expect everyone else to anticipate how I might feel about what they want to say and modify their speech accordingly.
There's a balance in there. I suspect that this balance is what we call "good manners" or "politeness".
Even when the insults could be considered objectively offensive I'm still generally okay with it. I've been called names quite a bit in the workplace over the years and I'm okay with it if makes them happy. Being autistic I don't really care and neurotypical people seem to get enjoyment calling people names so it's win-win.
High EQ individuals have told me this is wrong and the appropriate reaction would be to take offence and try to make them feel bad, but I've argued this would just result in an objective reduction in happiness in the world so it wouldn't make sense. Plus, most of the time my colleagues have been nice to my face so it's not like it's ever got in the way of me doing my job.
Words are not just symbols on paper, they are image triggers in your brain.
"Sticks and stones may break my bones, but words will never hurt me" is something told to children in the playground who are too young to know better.
You might as well argue that there's no such thing as malware, because its entirely harmless until executed on a processor.
I mean I don't like the (self) censorship either and don't mind rude or mature words, since we're all mature here and calling a fucking asshole a f*ing a$$h0le is an insult to people's intellect.
There's been a running joke on the UK Reddit subs recently about people getting short bans for using the term "faggot" as it's now on some automatic, site-wide blocklist. The bans are completely context free, so people are supposedly getting banned for discussing the food product, a kind of large meat-ball made primarily from minced offal. The same happens with "fag", meaning a cigarette.
I wonder, is it too much to ask that we (as in, the various English speaking places) understand the different meanings of words that are offensive in one dialect but have a different and mundane meaning in another dialect?
Also, I dont see any sense in actually enforcing any of those lists. I always think about the "Journey of Life" in The Grand Tour, which, frankly, is completely inoffensive in any other language than english https://www.youtube.com/watch?v=BLXe2WTYngQ
Old math prof of mine was named Dr. Cock, and he wasn't the only one, the name was common. Students called him Dr. Octocock.
Georgyo has orgy right in the name. More and more places are refusing to accept it when signing up.
Fwiw, it's well-known as such, after relatively high-profile incidents involving people with that address: https://en.wikipedia.org/wiki/Scunthorpe_problem
You have a customer who gives her name as "Fanny Batter". Is that a permissible name?
What about "Juan Kerr"? Or "Amanda Huggenkiss"?
It's not about replacing "ass" with "butt", but detecting rudeness/humour in context.
I'd be amazed if machine learning can solve this one. It's extremely hard for humans.
* all those names should be rejected btw
It does remind me of the XKEYSCORE (Snowden leaks) that used keywords to bubble up potential threats from emails etc https://www.businessinsider.com/nsa-prism-keywords-for-domes... .
I see the same here. It's not clever, but no one has any doubt what words are being checked.
The engineer would write something for every test case the product manager complained about, anything else computationally easy, and call it a day.
I once had to implement an audit logging system. What was supposed to be logged? "Important actions." Nobody on the team could define it. We just logged every database write along with the username responsible and called it a day. Nobody ever followed up or inspected it.
Same deal. Both exist mostly for compliance.
It used to be "if I search for this term, am I accidentally going to wind up getting goatse or something?" The good old days.
Now it's "if I search for this term, is the FBI going to kick my door in?"
It's not a list of words used by the NSA or any spies. https://attrition.org/misc/keywords.html
$Username=<username>
$Password=<password>
Connection.string=($Username, $Password)
The parser would flag this - Password was being stored to a variable! So we just changed our code:
$pw=<Password>. Problem solved!
This must be more nuanced. Maybe, it's "additional algo processing when a word is hit", eg another layer before "involve human".
Of all the ways I expected a talking banana to backfire, I didn't expect this one. Thanks for sharing
That sums up the WoW Classic (and I'm sure many other gaming communities) a little too perfectly.
CREATE OR REPLACE FUNCTION is_blasphemy (VARCHAR) RETURNS BOOLEAN STABLE AS $$
SELECT replace($1,'_','') SIMILAR TO '%p(o|0)rc(o|0)di(o|0)%'
OR replace($1,'_','') SIMILAR TO '%p(o|0)rc(o|0)mad(o|0)nna%'
$$ LANGUAGE SQL;c) "Porco dio saranno mica i testimoni di Geova? No eh diocan digli che i signori sono fuori, non ho tempo per stargli dietro." ("Fuck, they can't be Jehovah's witnesses, can they? Tell them we're out, we don't have time for their shit.")
This happens all the time in cartoons that appeal to children but also contain subtle adult jokes so everyone can enjoy them on different levels.
What worked in the end was having any newly created thread send a message containing the post title & body to a slack channel specifically for monitoring the forums. Employees and our forum moderators were in there, and any bad threads were nearly instantly deleted. Eventually the spammers mostly gave up. Hard to beat a dozen human brains :)
Having a dozen reviewers, especially if spread across timezones, would be a dream!
I have some fun emails whose subject line is "here are your ISIS family log-in details"
and this was right around the time the terrorist group was frequently in the news
The Iraq War began in 2003.
Bush's administration ended in 2009.
ISIS gained power in 2014, 5 years into Obama's administration.
Looks like "Baggins" is banned.
Poor Bilbo...
The story we know was written by him. I wonder what the trolls would say, or the dead dragon, or the town that his actions helped destroy?
Is he a hero, or, just the guy who wrote it all down?
And just when something really important happens, he throws a powerful weapon(the one right) at his nephew and goes away to retire!
A life lead with riches (gold from the trolls), a ring granting him extremely long life and health, and yup.. off he goes, first sign of real trouble.
Poor Bilbo indeed!
Poor Mike.
That way they could more easily maintain similar looking and sounding words, including leetspeak and other variants.
Because with this kind of approach, something like "yolocaust" will get through, as most checks only go with exact matches and permutations will always get around it easily...
Whereas with something like a levenshtein distance you could compare it with a set of words and syllables and if it's too similar looking, e.g. 90% the same distance compared to username length, you could simply block it.
Should optimize checks with a de-obfuscation function (attempt to expand non-AZ back to AZ, even if that then shoots permutations at banned word-runs).
It should probably also look more like a spam scoring system, where really obvious stuff is hard-trashed but borderline things are flagged for review / discussion.
I am also very disturbed that, as with most censorship, 'obviously bad' things such as terrorism/etc are co-mingled with 'is adult' as a negative check.
It seems reasonable for Twitch to have validated 'safe for minors' areas where names are filtered. Generic areas, where things are in the gray area and unchecked. Adult Only areas, where swears, profanity, maybe even some of the hateful things are allowed. Informed consumer choice.
- should be in its own application with its own rules engine so you dont accidentally whack a bunch of userames
- I would have done in the past and cringe when people ask me to update it.
That's not even offensive and there are still way more weed-nicknames available, so I don't get it.
To our surprise, the end user did not feel offended at all. In fact, they were happy because we responded instantly instead of the usual 24 to 48 hours.
[Story]: https://idiallo.com/blog/do-you-make-your-customers-wait
Is there a legit service I can use or some actual well tested library that can help me? I’m using node and Go so either languages.
Looks like Twitch doesn't like weed.
Bob: "Sure no problem if it's only temporary"
--few months later
Chief architect:
"People are spamming more and more, we should design new system for these 234 new bad words, I need team of 7, two backend guys, 5 frontend guys and 4 weeks. It will also require minor rewrite of few external components."
Boss: "geez we're in the middle of sprint right now, Bob can you add these 234 words to existing filter? Make sure it's in production before lunch, thanks" (checks watches) "I have to go now, meeting with customer, bye".
The reason is: This can be updated very easily and out-of-band of other deploys of the main application code.
To make this work it has to be super agile to update. Probably this version we are looking at is an old version that happened to be put in git. The functions in prod probably have several additions since.
The main 'bad words' filter of English Wikipedia:
https://en.wikipedia.org/wiki/Special:AbuseFilter/384
The page (on en:wiki) listing all filters, which also have uses other than detecting abuse:
https://en.wikipedia.org/wiki/Special:AbuseFilter
The special page has the same title in other editions of Wikipedia and other Wikimedia wikis, though many filters are set to hidden. Dutch Wikipedia, for example:
You can still use any other language except English to achieve the goal .
CREATE OR REPLACE FUNCTION is_hateful (VARCHAR) RETURNS BOOLEAN STABLE AS $$ SELECT ... OR replace($1,'_','') LIKE '%aggin%'
Also, Lisa Pedro is not welcome there.
Would regex be really much faster than checking it against a 1000 or more bad word list?
Also bad word list can easily get updated by moderators as well, I really can’t understand the logic behind using so much regex.
But note: for the most part this isn't using regexs, and to the extent it does, it seems largely intended to make the maintainers' lives easier by avoiding having to represent (and maintain) all the permutations they are trying to match for.
What's sad though is that they're doing many, many passes through the pattern matcher, rather than just building a single big DFA from the whole list of patterns they want to match, which gets traversed in one pass.
Highlights:
CREATE OR REPLACE FUNCTION is_tragedy (VARCHAR) RETURNS BOOLEAN STABLE AS $$
SELECT replace($1,'_','') LIKE '%george%floyd%'
$$ LANGUAGE SQL;
CREATE OR REPLACE FUNCTION is_derogatory (VARCHAR) RETURNS BOOLEAN STABLE AS $$
SELECT replace($1,'_','') LIKE '%retard%'
$$ LANGUAGE SQL;
Now we know we can make as many goergefloyd accounts as we want! *devious grin* *chuckles to self*Now I will never have an IG account linked to my FB. Oh well, I can live without that but thank god I don't depend on that platform for business. It made me laugh but there is zero appeal available.
In my youthful innocence I had assumed the characters in the movie were trading silly, made-up names (e.g. "Who are you calling "sparky", Popeye?"), not racial epithets.
The fact that this list needs to exist makes me sad, but I am glad technology can assist on some level with the issue.
Sometimes it’s just not worth the effort to try and help people solve these problems.
- Have people choose any username they want
- Have an “after the fact” human review system on usernames
- If the username is inappropriate, change it to “smallsausage[0-9]+” without the option of reverting it back or requesting a new user on the same e-mail address.
Looks like "SeeKylePlay" would trigger an insta-ban for "Sieg-Heil". F.
I have a hard time believing this silly mess is an actual component of anything.
sukciD suggiB was here..
Just six seemingly harmless letters arranged in a way to form a word with more power than the pieces of metal which is forged to make swords.
Just a couple of G's, an R and an E, an I and an N....
I pity those who chose that username based on The Witcher's Nekkers.