Perhaps gay is not a dirty word? (is included in your dirty words, but gay people should think otherwise.
A lot of people use the term "gay" in conversation as a synonym for "that sucks"; a friend of mine does it all the time. I don't think they mean anything by it.
To differentiate between "I am gay," and "Oh that's gay. I'm sorry that happened," you'd need a NLP with a politeness preference.
This is a very naive implementation to quickly get a handle of amount of porny documents. I intend to do some more work around clustering of porny words. I think understanding sentiment would be hard and involves a lot of labeled data, but that is a potentially very useful project.
// Slut.
Which is Danish for "the end".
I didn't look at the implementation, but the "classy party" looks like it simply matches for a sequence of 'a', 's', and 's' bytes in a string.
It would be better it it tokenized the sentence using punctuation and white-space as terminators. So, it would detect `big-ass sandwich` and `smart-ass person` but not `classy party` or `bass instrument`.
Furthermore, it would be cool if you created a configuration format for this kind of thing, so one could do something like this (excuse the config format, I realise it's probably shit and problematic):
[smart][big][fat]ass
!sex[ual]+education
which would detect all of the following: smartass, bigass, fatass, and ass itself. The second rule would not filter `sex(?:ual)` token followed by an `education` token. You get the ideaThese are just some heat-of-the-moment ideas, because I think this is exciting and could be useful. :-)
Though I don't have time/RoI to improve this, but potential ideas are to use labeled data to cluster porny words and get a probablistic metric of porni-ness of a sentence.
The full list is here: http://pastebin.com/raw.php?i=1Pv4v8j7
It contains such gems as "cockburger", "penispuffer", and -- the piece de resistance -- "twatwaffleunclefucker".