undefined | Better HN

story

0 pointsthaumasiotes12y ago0 comments

I had to recognize English for a cryptography project in college. This was my entire strategy, which the professor advised for rejecting non-English (it turns out recognizing non-English is easier than recognizing English), and works well:

1. Parse the full text of some long book available on Project Gutenberg. Record every trigram that occurs. Frequency is irrelevant; we just want to know whether a trigram occurs or not.

2. Go through your text, counting the number of trigrams that didn't exist in the sample text. If the number of strange trigrams exceeds a threshold (for my project I used a threshold of 3, but you can tune this), reject the text as non-English.

Given the constant finite threshold I used, that does amount to a regex, but I don't recommend trying to write it out explicitly.

0 comments

pbhjpbhj12y ago

How effective was your algo?

Presumably it would false-negative texts with neologisms and spelling errors and such.

There must be trigrams which simply don't appear in dictionary-English? Do you have results posted somewhere.