Ask HN: Algorithms for text fingerprinting?

92 pointsvixsomnis11y ago41 comments

I remember reading an article a year or so ago about (the NSA) identifying users based on how they write: vocabulary, spelling mistakes, grammar, dialect, and so on.

This is interesting to me because it is extremely difficult to change the vocabulary I use in writing and speaking. Being able to estimate the amount of similarity between two pieces of text would be useful.

The closest I can think of right now would be the proprietary algorithms used to check for plagiarism (for schools and universities, for instance).

Are there any publicly available algorithms for this? Where can I go to learn more? (Academic journals?) Am I just DDGing the wrong search terms?

Ask HN: Algorithms for text fingerprinting?

92 pointsvixsomnis11y ago41 comments

I remember reading an article a year or so ago about (the NSA) identifying users based on how they write: vocabulary, spelling mistakes, grammar, dialect, and so on.

The closest I can think of right now would be the proprietary algorithms used to check for plagiarism (for schools and universities, for instance).

Are there any publicly available algorithms for this? Where can I go to learn more? (Academic journals?) Am I just DDGing the wrong search terms?

41 comments

38 comments · 17 top-level

moyix11y ago· 6 in thread

The relevant search term is "stylometry". One particular paper I remember is from Dawn Song's group at Berkeley a couple years back:

http://www.cs.berkeley.edu/~dawnsong/papers/2012%20On%20the%...

There's a lot of public work on the topic, but it looks like right now the best place to look is still in academic papers (I don't know of any open source libraries, for example).

Osmium11y ago

I'm not sure whether it's appropriate to mention this or not, but I was just looking up the authors of that paper to see what they're doing now. Very sadly, I found out that one of the authors, Emil Stefanov, died last year at the age of 26.

http://www.rememberingemil.org

functional_test11y ago

Thank you for posting this (it was quite a surprise to see his name here).

I was a close friend of Emil for a long time. Before he passed, I had gotten busy with work and hadn't spoken to him in a while. I saw some of our mutual friends the weekend before it happened and had planned to call him. Really wish I had made that call sooner.

Intermernet11y ago

JGAAP [1] may be useful (AGPLv3 license):

"JGAAP is intended to tackle two different problems, firstly to allow people unfamiliar with machine learning and quantitative analysis the ability to use cutting edge techniques on their text based stylometry / textometry problems, and secondly to act as a framework for testing and comparing the effectiveness of different analytic techniques' performance on text analysis quickly and easily."

[1]: http://evllabs.com/jgaap/w/

Looks like there are some other recommendations at http://evllabs.com/jgaap/w/index.php/FAQ#What_other_tools_ar...

vixsomnisOP11y ago

That paper is exactly the sort I was looking for. Thanks.

I'm surprised there isn't any open source effort in this area yet (I couldn't find any either). It's just as important as TOR and other anonymity services, since it affects not just passively consuming information, but actively creating it.

moyix11y ago

If you're interested in defeating stylometry, you should check out Sadia Afroz's PhD thesis as well:

https://www.eecs.berkeley.edu/~sa499/thesis.pdf

nl11y ago

This is a good search to start from: https://scholar.google.com.au/scholar?as_ylo=2015&q=stylomet...

amazing_jose11y ago· 6 in thread

The results could be horrible, but can imagine a simple technique for hiding all those clues. Just send the text to google translate, translate it to an intermediate language and the back to the original one. I can warranty an excellent t rinse and clean. Change the intermediate language and you will change the features of the final text. Of course, you risk horrible semantic changes in the final text ;)

UPDATE: fix typos.

ShaneWilton11y ago

There's a great paper that investigated this technique: https://www.eecs.berkeley.edu/~sa499/papers/adversarial_styl...

From the conclusions: "Translation with widely available machine translation services does not appear to be a viable mode of circumvention. Our evaluation did not demonstrate sufficient anonymization and the translated document has, at best, questionable grammar and quality."

DennisP11y ago

Has there been any other work on anti-fingerprinting? Seems like just taking the fingerprinting algorithms and applying transforms that randomize the features they look for would go a long way.

vixsomnisOP11y ago

Took only a minute to try:

English -> Filipino (Tagalog) -> Chinese-simplified (Mandarin?) -> English

    I remember reading an article about a year ago (NSA) to identify the
    user, based on how they are written, vocabulary, spelling errors,
    grammar, language, and so on.

    It is interesting to me, because it is difficult to change the written
    and spoken word in use. It can be estimated that there are between two
    characters similar amount of help.

    Recently I can think of now is to check plagiarism (used in schools and
    universities, for example) is proprietary algorithm.

    Are there any public this algorithm? I can find out more information?
    (Academic journals?) I just DDGing wrong search terms?

I have to say, this is a great idea. There was some information lost in transit, but most of my thoughts came through (albeit broken). It's probably worse since I used multiple intermediaries and Mandarin doesn't map onto English (or vice-versa) in grammar or vocabulary.

edit: A site for this exists. http://ackuna.com/badtranslator

kngspook11y ago

Google Translate almost holds it together...until the very end...

Original:

  In the beginning God created the heaven and the earth. And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.

[a half dozen languages later...]

Result:

  God created heaven and earth in the beginning. And the earth was formless and empty, and darkness was on the face of the abyss, man. Answer the Spirit of God on the water surface.

tokenizerrr11y ago

Of course you're sending all this information to Google now. Are there any offline translators that are advanced enough to be used for something like this? I imagine most just naively map words 1:1 which wouldn't do much good here.

1 more reply

nl11y ago

Google translate to Tagalog is generally not great.

If you chain through European languages you barely lose anything. Doing English->Dutch->German->English is a good set to use.

j4211y ago· 2 in thread

Figured I'd chime in here since I developed an algorithm recently that could be applied to this problem with some basic ML.

Basically the first step would be shingling the text (choosing a sampling domain) and generating a MinHash struct (computationally cheap) which can then be used to find the "similarity" between sets, or, the "Jaccard Index."

If you're clever about this, you can use HyperLogLogs to encode these MinHash structs gaining a great deal of speed with a marginal error rate, all while allowing for arbitrary N-levels of intersection.

If you're looking to build a model to analyze two (or N) text bodies for stylometric similarities, I'd approach the problem in two steps:

1) Minimize the relevant input text.

- Use a bernoulli/categorical distribution to weight words according to uniqueness--NLP and sentiment extraction techniques may also help

- Design a markov process to represent more complex phrasing patterns for the text as a whole

- Filter by a variable threshold to minimize the resulting set of shingles/bins/"interesting nodes" into a computationally-manageable #

2) Use an efficient MinHash intersection to compute a similarity vector (0-1) for the two texts.

I think given the prevalence of training data (I mean, what's more ubiquitous than the written word...) you could probably tune this to a reasonable accuracy and efficient complexity.

Just a 5m thought exercise, but if anyone else has ideas I'd be curious as well :)

patientfrog11y ago

Could you describe in a little more detail what you mean by this sentence:

> "... use HyperLogLogs to encode these MinHash structs gaining a great deal of speed with a marginal error rate, all while allowing for arbitrary N-levels of intersection"

Thanks

j4211y ago

Sure thing.

Let's say we have two sets, and we're trying to find out how similar they are:

    setOne = ['the','brown','fox','jumped','a','log']
    setTwo = ['the','quick','brown','log','jumped','over','the','fox']

You could use an array intersection when its small, but if you want to do this efficiently at scale you need to take advantage of probabilistic data structures.

Let's say you created a HyperLogLog for each set:

    setOne = (bitfield representing setOne)
    setTwo = (bitfield representing setTwo)

Now HyperLogLogs are cool, because you can merge them together w/o losing anything. You can't retrieve the data, but you can efficiently check if a value exists inside. You can have a false positive, but never a false negative.

You might first try simple combinatorics (a la, similarity ~= |A or B| / |A ∪ B|) however this can get hairy depending upon the representation of the HyperLogLog (sparse/dense) and its respective cardinality.

Eventually you realize you can't accept an exponentially-compounding error rate, but still need the raw efficiency, and thus you sacrifice by doubling your storage cost.

Now instead of just the initial sets, you'd also build a parallel MinHash struct:

    setOne_minhashes = [<int>, <int>, <int>]
    setTwo_minhashes = [<int>, <int>, <int>]

    setOne = (bitfield representing setOne)
    setTwo = (bitfield representing setTwo)
    setOneMinHash = (bitfield representation of setOne minhashes)
    setTwoMinHash = (bitfield representation of setTwo minhashes)

I won't go into excessive detail about the minhash algorithm itself, but essentially it provides a way to sample a large set of values by selecting/retaining the smallest output hashes. What that means is, you can then intersect the minhash bitfields as many times over as you like, and extract a predictably accurate similarity index in linear time with definable confidence bounds.

1 more reply

gull11y ago· 2 in thread

Check this out: http://www.secretlifeofpronouns.com/exercises.php

vixsomnisOP11y ago

I did the bottle exercise. Very interesting.

I see the term I was looking for was LSM (language style matching). That should help me do more research.

Thanks for the link.

gull11y ago

Here's a project idea you are free to steal: write a browser plugin that analyzes HN comments and shows up something on the screen. Like "honest", "deceptive", "gratuitously negative".

Brownie points for HN admins if they add text analysis as feedback when we press submit.

I'm tired of signing up with a brand new HN account every few months to cover my tracks after embarrassing myself with less than noble posts.

1 more reply

bolomega1000011y ago· 2 in thread

A simple one is based on analysing stop words. I guess you could do vector similarity of stop word relative frequency. You could try additional features such as word bigrams and trigrams and contain stop words. In other words, things like, "all the words the author uses that commonly surround 'of'" to select on stop word containing common phrases.

There is something about the stop word use pattern that makes them harder to forge.

I've never tried this and I don't know much more about it than that, so I strongly suggest you also find papers that treat authorship attribution by stop words.

pimlottc11y ago

What's a "stop word"? Last word in a sentence?

apendleton11y ago

No, they're usually function words that are in most documents and, at least looking at a document as a bag of words, consequently aren't very analytically useful. Think "the," "of," "and," etc.

MalcolmDiggs11y ago· 1 in thread

When I was in college we turned in papers via "Turnitin" which checked for plagiarism and uniqueness etc.

There's an interesting research paper about their algorithms here: https://www.cs.auckland.ac.nz/courses/compsci725s2c/archive/...

And if you search for "Turnitin Plagiarism Algorithm" I'm sure you'll find a few more resources.

apendleton11y ago

Plagiarism detection is a somewhat different problem, in that it's looking for specific common text rather than just stylistic choices. Usually it's just looking for high percentages of overlapping ngrams between a test document and documents in a corpus, but two different documents written by the same person wouldn't test positive.

nodelessness11y ago· 1 in thread

This one time JK Rowling was found out writing by a pseudonym[1] using this program:

http://evllabs.com/jgaap/w/index.php/Main_Page

[1] http://blogs.wsj.com/speakeasy/2013/07/16/the-science-that-u...

lucb1e11y ago

If I remember correctly, someone actually talked about it and they claimed using this program to cover for the person who leaked. But I may be wrong.

Edit: https://en.wikipedia.org/wiki/The_Cuckoo%27s_Calling#Authors...

> However, it was later reported that Rowling's authorship was leaked to a Times reporter via Twitter by the friend of the wife of a lawyer at Russells Solicitors, who had worked for Rowling. The firm has since apologised[29] and made a "substantial charitable donation" to the Soldiers' Charity as a result of legal action brought by Rowling.[30]

bane11y ago· 1 in thread

Here's a quick one:

1) tokenize each text into a different bag(set) of words.

2) Compute the Jaccard index[1] using the two sets.

Here's another

1) tokenize each text into a multi-bag(set) of words, keeping track of token frequency

2) keeping the token frequency, order the sets into lists

3) map the lists of words onto an n-dimensional space (where n is say...all of the words into the two documents) as vectors

4) compute the cosine similarity [2]

Here's another:

1) tokenize the texts into two bags of words

2) compute the set difference going both ways.

3) does either difference contain discriminator tokens that rule it out as being from that person

4 (optional)): extend to 2-3-n-grams

Here's another (a variant of the one above):

1) compute 1-2-3-n-grams from one of the texts

2) insert the n-grams into a set

3) compute the same for the second document and test for set membership

4) compute the number of total n-grams from your second document

5) compute (non-in-set/total-n-grams) * 100 to yield a "uniqueness" measure

6) determine if the second document is "unique" enough

And another:

1) assuming you have a sample corpus from a writer and want to know if a new text belongs in that corpus

2) follow the method above but for step #1 and 2 do it with the entire reference corpus

And another:

1) produce an ontology of discriminator terms and categories unique to the writer

2) use an (named entity recognition) NER tool of some kind to find those terms in each document

3) use the set of found terms as an alternative to a bag of words for the Jaccard or Vector models above

You may need to play with stopword list removal, tokenization schemes and n-gram windows (for example, omitting 1-grams might focus the analysis on phrase usage vs. vocabulary usage)

1 - https://en.wikipedia.org/wiki/Jaccard_index

2 - https://en.wikipedia.org/wiki/Cosine_similarity

ai_maker11y ago

This is an implementation in Python for the second option, but instead of word frequencies it uses word presence (a binary vector in the end):

https://github.com/atrilla/vsmpy

thatcat11y ago

Jstylo might be what you're looking for. https://github.com/psal/jstylo

The same group also has created a text obfuscation tool called anonymouth that helps you obfuscate your word choices, but it has still yet to be released. https://psal.cs.drexel.edu/index.php/JStylo-Anonymouth

benten1011y ago

I may be late to the party, but finally my time to shine! As has been mentioned earlier, the field gou are looking for is called stylometry, and has almost a century of history behind it, and also the field if my thesis. After looking at what everyone has been saying, I just felt like copying and pasting my 20 page literature review here, but I'd recommend you look at narayanan at all (2012) paper (internet scale authorship attribution). The algorithm it uses is not particularly complex and would take you a week, tops, to implement if you put in a few hours a day, and that's including doing all the related research and catching up with the linear algebra involved if you need to.

gnur11y ago

I've started a simular project myself recently, I check on various parameters (reading level score, words per sentence, syllables per word, sentences per paragraph, average word length, average syllable count) and calculate the distance between 2 texts / authors using a simple euclidean distance.

I started out with the code provided on https://github.com/mac389/ToxTweet/blob/master/textanalyzer.... I use it in a private project, but the results are promising!

pdpd11y ago

I wonder if this is how the authors of Truecrypt where identified. I remember reading something similar about this in regards to coding style.

I am sure the Truecrypt authors contributed to more than one project.

MasterScrat11y ago

JGAAP is pretty awesome, it has both Java API and GUI: https://github.com/evllabs/JGAAP

JStylo that was already mentioned is based on JGAAP. You have some more here: http://evllabs.com/jgaap/w/index.php/FAQ#What_other_tools_ar...

CSDude11y ago

There is a public domain fingerprinting tool, that is used in MOSS (Measure Of Software Similarity). You can get the ideas from there.

http://theory.stanford.edu/~aiken/publications/papers/sigmod...

fauxfauxpas11y ago

somewhat related - excerpt from Cryptonomicon - The percussionist stands up. "Every radio operator has a distinctive style of keying—we call it his fist. With a bit of practice, our Y Service people can recognize different German operators by their fists—we can tell when one of them has been transferred to a different unit, for example."

and this article by Schneier - Identifying People By Their Writing Style - https://www.schneier.com/blog/archives/2011/08/identifying_p...

espe11y ago

afaik, stylo (https://sites.google.com/site/computationalstylistics/stylo) is the academic go-to solution. it is even sporting a gui

wodenokoto11y ago

You want to look for 'author attribution' as your keyword.

There are 2 main ways for assessing author attribution. One is through stylistic markers, where you look for a set of predefined features. The is average length per paragraph, or the number of times 'whenever' is used. This is highly language dependant.

The other way is through character n-gram analysis. You chose for which N you want to harvest N-grams and your author profile is the frequency of top 2000 n-grams and you compare this profile with a documents top 2000 n-grams and the profile with the shortest distance is your match.

Robert Layton has a tutorial and some code on N-gram attribution on Github:

* https://github.com/robertlayton/authorship_tutorials

* https://github.com/robertlayton/author-detection

And here's a list of papers I've reviewed while doing a similar project.

[1] Shlomo Argamon, Moshe Koppel, Jonathan Fine, and Anat Rachel Shimoni. Gender, genre, and

writing style in formal written texts.

23(3):321–346, 2003.

[2] John F Burrows. ‘an ocean where each kind...’: Statistical analysis and some major determinants

of literary style. Computers and the Humanities, 23(4-5):309–321, 1989.

[3] Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis, and Sokratis Katsikas. Source

code author identification based on n-gram author profiles. In Artificial Intelligence Applica- tions and Innovations, pages 508–515. Springer, 2006.

[4] Sheena Gardner and Hilary Nesi. A classification of genre families in university student writing.

Applied linguistics, 34(1):25–52, 2013.

[6] John Houvardas and Efstathios Stamatatos. N-gram feature selection for authorship identifica- tion. In Artificial Intelligence: Methodology, Systems, and Applications, pages 77–86. Springer,

2006.

[7] Patrick Juola. Authorship attribution. Foundations and Trends in information Retrieval,

1(3):233–334, 2006.

[8] Vlado Kešelj, Fuchun Peng, Nick Cercone, and Calvin Thomas. N-gram-based author profiles

for authorship attribution. In Proceedings of the conference pacific association for computational

linguistics, PACLING, volume 3, pages 255–264, 2003.

[9] Maarten Lambers and Cor J Veenman. Forensic authorship attribution using compression dis- tances to prototypes. In Computational Forensics, pages 13–24. Springer, 2009.

[11] Fiona J Tweedie and R Harald Baayen. How variable may a constant be? measures of lexical

richness in perspective. Computers and the Humanities, 32(5):323–352, 1998.

[12] Cor J Veenman and Zhenshi Li. Authorship verification with compression features.

[13] Rong Zheng, Jiexun Li, Hsinchun Chen, and Zan Huang. A framework for authorship identifi-

cation of online messages: Writing-style features and classification techniques. Journal of the

American Society for Information Science and Technology, 57(3):378–393, 2006.

j / k navigate · click thread line to collapse

41 comments

38 comments · 17 top-level

moyix11y ago· 6 in thread

The relevant search term is "stylometry". One particular paper I remember is from Dawn Song's group at Berkeley a couple years back:

http://www.cs.berkeley.edu/~dawnsong/papers/2012%20On%20the%...

There's a lot of public work on the topic, but it looks like right now the best place to look is still in academic papers (I don't know of any open source libraries, for example).

Osmium11y ago

http://www.rememberingemil.org

functional_test11y ago

Thank you for posting this (it was quite a surprise to see his name here).

Intermernet11y ago

JGAAP [1] may be useful (AGPLv3 license):

[1]: http://evllabs.com/jgaap/w/

Looks like there are some other recommendations at http://evllabs.com/jgaap/w/index.php/FAQ#What_other_tools_ar...

vixsomnisOP11y ago

That paper is exactly the sort I was looking for. Thanks.

moyix11y ago

If you're interested in defeating stylometry, you should check out Sadia Afroz's PhD thesis as well:

https://www.eecs.berkeley.edu/~sa499/thesis.pdf

nl11y ago

This is a good search to start from: https://scholar.google.com.au/scholar?as_ylo=2015&q=stylomet...

amazing_jose11y ago· 6 in thread

UPDATE: fix typos.

ShaneWilton11y ago

There's a great paper that investigated this technique: https://www.eecs.berkeley.edu/~sa499/papers/adversarial_styl...

DennisP11y ago

Has there been any other work on anti-fingerprinting? Seems like just taking the fingerprinting algorithms and applying transforms that randomize the features they look for would go a long way.

vixsomnisOP11y ago

Took only a minute to try:

English -> Filipino (Tagalog) -> Chinese-simplified (Mandarin?) -> English

    I remember reading an article about a year ago (NSA) to identify the
    user, based on how they are written, vocabulary, spelling errors,
    grammar, language, and so on.

    It is interesting to me, because it is difficult to change the written
    and spoken word in use. It can be estimated that there are between two
    characters similar amount of help.

    Recently I can think of now is to check plagiarism (used in schools and
    universities, for example) is proprietary algorithm.

    Are there any public this algorithm? I can find out more information?
    (Academic journals?) I just DDGing wrong search terms?

edit: A site for this exists. http://ackuna.com/badtranslator

kngspook11y ago

Google Translate almost holds it together...until the very end...

Original:

  In the beginning God created the heaven and the earth. And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.

[a half dozen languages later...]

Result:

  God created heaven and earth in the beginning. And the earth was formless and empty, and darkness was on the face of the abyss, man. Answer the Spirit of God on the water surface.

tokenizerrr11y ago

1 more reply

nl11y ago

Google translate to Tagalog is generally not great.

If you chain through European languages you barely lose anything. Doing English->Dutch->German->English is a good set to use.

j4211y ago· 2 in thread

Figured I'd chime in here since I developed an algorithm recently that could be applied to this problem with some basic ML.

If you're looking to build a model to analyze two (or N) text bodies for stylometric similarities, I'd approach the problem in two steps:

1) Minimize the relevant input text.

- Use a bernoulli/categorical distribution to weight words according to uniqueness--NLP and sentiment extraction techniques may also help

- Design a markov process to represent more complex phrasing patterns for the text as a whole

- Filter by a variable threshold to minimize the resulting set of shingles/bins/"interesting nodes" into a computationally-manageable #

2) Use an efficient MinHash intersection to compute a similarity vector (0-1) for the two texts.

I think given the prevalence of training data (I mean, what's more ubiquitous than the written word...) you could probably tune this to a reasonable accuracy and efficient complexity.

Just a 5m thought exercise, but if anyone else has ideas I'd be curious as well :)

patientfrog11y ago

Could you describe in a little more detail what you mean by this sentence:

> "... use HyperLogLogs to encode these MinHash structs gaining a great deal of speed with a marginal error rate, all while allowing for arbitrary N-levels of intersection"

Thanks

j4211y ago

Sure thing.

Let's say we have two sets, and we're trying to find out how similar they are:

    setOne = ['the','brown','fox','jumped','a','log']
    setTwo = ['the','quick','brown','log','jumped','over','the','fox']

You could use an array intersection when its small, but if you want to do this efficiently at scale you need to take advantage of probabilistic data structures.

Let's say you created a HyperLogLog for each set:

    setOne = (bitfield representing setOne)
    setTwo = (bitfield representing setTwo)

Eventually you realize you can't accept an exponentially-compounding error rate, but still need the raw efficiency, and thus you sacrifice by doubling your storage cost.

Now instead of just the initial sets, you'd also build a parallel MinHash struct:

    setOne_minhashes = [<int>, <int>, <int>]
    setTwo_minhashes = [<int>, <int>, <int>]

    setOne = (bitfield representing setOne)
    setTwo = (bitfield representing setTwo)
    setOneMinHash = (bitfield representation of setOne minhashes)
    setTwoMinHash = (bitfield representation of setTwo minhashes)

1 more reply

gull11y ago· 2 in thread

Check this out: http://www.secretlifeofpronouns.com/exercises.php

vixsomnisOP11y ago

I did the bottle exercise. Very interesting.

I see the term I was looking for was LSM (language style matching). That should help me do more research.

Thanks for the link.

gull11y ago

Here's a project idea you are free to steal: write a browser plugin that analyzes HN comments and shows up something on the screen. Like "honest", "deceptive", "gratuitously negative".

Brownie points for HN admins if they add text analysis as feedback when we press submit.

I'm tired of signing up with a brand new HN account every few months to cover my tracks after embarrassing myself with less than noble posts.

1 more reply

bolomega1000011y ago· 2 in thread

There is something about the stop word use pattern that makes them harder to forge.

I've never tried this and I don't know much more about it than that, so I strongly suggest you also find papers that treat authorship attribution by stop words.

pimlottc11y ago

What's a "stop word"? Last word in a sentence?

apendleton11y ago

No, they're usually function words that are in most documents and, at least looking at a document as a bag of words, consequently aren't very analytically useful. Think "the," "of," "and," etc.

MalcolmDiggs11y ago· 1 in thread

When I was in college we turned in papers via "Turnitin" which checked for plagiarism and uniqueness etc.

There's an interesting research paper about their algorithms here: https://www.cs.auckland.ac.nz/courses/compsci725s2c/archive/...

And if you search for "Turnitin Plagiarism Algorithm" I'm sure you'll find a few more resources.

apendleton11y ago

nodelessness11y ago· 1 in thread

This one time JK Rowling was found out writing by a pseudonym[1] using this program:

http://evllabs.com/jgaap/w/index.php/Main_Page

[1] http://blogs.wsj.com/speakeasy/2013/07/16/the-science-that-u...

lucb1e11y ago

If I remember correctly, someone actually talked about it and they claimed using this program to cover for the person who leaked. But I may be wrong.

Edit: https://en.wikipedia.org/wiki/The_Cuckoo%27s_Calling#Authors...

bane11y ago· 1 in thread

Here's a quick one:

1) tokenize each text into a different bag(set) of words.

2) Compute the Jaccard index[1] using the two sets.

Here's another

1) tokenize each text into a multi-bag(set) of words, keeping track of token frequency

2) keeping the token frequency, order the sets into lists

3) map the lists of words onto an n-dimensional space (where n is say...all of the words into the two documents) as vectors

4) compute the cosine similarity [2]

Here's another:

1) tokenize the texts into two bags of words

2) compute the set difference going both ways.

3) does either difference contain discriminator tokens that rule it out as being from that person

4 (optional)): extend to 2-3-n-grams

Here's another (a variant of the one above):

1) compute 1-2-3-n-grams from one of the texts

2) insert the n-grams into a set

3) compute the same for the second document and test for set membership

4) compute the number of total n-grams from your second document

5) compute (non-in-set/total-n-grams) * 100 to yield a "uniqueness" measure

6) determine if the second document is "unique" enough

And another:

1) assuming you have a sample corpus from a writer and want to know if a new text belongs in that corpus

2) follow the method above but for step #1 and 2 do it with the entire reference corpus

And another:

1) produce an ontology of discriminator terms and categories unique to the writer

2) use an (named entity recognition) NER tool of some kind to find those terms in each document

3) use the set of found terms as an alternative to a bag of words for the Jaccard or Vector models above

You may need to play with stopword list removal, tokenization schemes and n-gram windows (for example, omitting 1-grams might focus the analysis on phrase usage vs. vocabulary usage)

1 - https://en.wikipedia.org/wiki/Jaccard_index

2 - https://en.wikipedia.org/wiki/Cosine_similarity

ai_maker11y ago

This is an implementation in Python for the second option, but instead of word frequencies it uses word presence (a binary vector in the end):

https://github.com/atrilla/vsmpy

thatcat11y ago

Jstylo might be what you're looking for. https://github.com/psal/jstylo

benten1011y ago

gnur11y ago

I started out with the code provided on https://github.com/mac389/ToxTweet/blob/master/textanalyzer.... I use it in a private project, but the results are promising!

pdpd11y ago

I wonder if this is how the authors of Truecrypt where identified. I remember reading something similar about this in regards to coding style.

I am sure the Truecrypt authors contributed to more than one project.

MasterScrat11y ago

JGAAP is pretty awesome, it has both Java API and GUI: https://github.com/evllabs/JGAAP

JStylo that was already mentioned is based on JGAAP. You have some more here: http://evllabs.com/jgaap/w/index.php/FAQ#What_other_tools_ar...

CSDude11y ago

There is a public domain fingerprinting tool, that is used in MOSS (Measure Of Software Similarity). You can get the ideas from there.

http://theory.stanford.edu/~aiken/publications/papers/sigmod...

fauxfauxpas11y ago

and this article by Schneier - Identifying People By Their Writing Style - https://www.schneier.com/blog/archives/2011/08/identifying_p...

espe11y ago

afaik, stylo (https://sites.google.com/site/computationalstylistics/stylo) is the academic go-to solution. it is even sporting a gui

wodenokoto11y ago

You want to look for 'author attribution' as your keyword.

Robert Layton has a tutorial and some code on N-gram attribution on Github:

* https://github.com/robertlayton/authorship_tutorials

* https://github.com/robertlayton/author-detection

And here's a list of papers I've reviewed while doing a similar project.

[1] Shlomo Argamon, Moshe Koppel, Jonathan Fine, and Anat Rachel Shimoni. Gender, genre, and

writing style in formal written texts.

23(3):321–346, 2003.

[2] John F Burrows. ‘an ocean where each kind...’: Statistical analysis and some major determinants

of literary style. Computers and the Humanities, 23(4-5):309–321, 1989.

[3] Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis, and Sokratis Katsikas. Source

code author identification based on n-gram author profiles. In Artificial Intelligence Applica- tions and Innovations, pages 508–515. Springer, 2006.

[4] Sheena Gardner and Hilary Nesi. A classification of genre families in university student writing.

Applied linguistics, 34(1):25–52, 2013.

[6] John Houvardas and Efstathios Stamatatos. N-gram feature selection for authorship identifica- tion. In Artificial Intelligence: Methodology, Systems, and Applications, pages 77–86. Springer,

2006.

[7] Patrick Juola. Authorship attribution. Foundations and Trends in information Retrieval,

1(3):233–334, 2006.

[8] Vlado Kešelj, Fuchun Peng, Nick Cercone, and Calvin Thomas. N-gram-based author profiles

for authorship attribution. In Proceedings of the conference pacific association for computational

linguistics, PACLING, volume 3, pages 255–264, 2003.

[9] Maarten Lambers and Cor J Veenman. Forensic authorship attribution using compression dis- tances to prototypes. In Computational Forensics, pages 13–24. Springer, 2009.

[11] Fiona J Tweedie and R Harald Baayen. How variable may a constant be? measures of lexical

richness in perspective. Computers and the Humanities, 32(5):323–352, 1998.

[12] Cor J Veenman and Zhenshi Li. Authorship verification with compression features.

[13] Rong Zheng, Jiexun Li, Hsinchun Chen, and Zan Huang. A framework for authorship identifi-

cation of online messages: Writing-style features and classification techniques. Journal of the

American Society for Information Science and Technology, 57(3):378–393, 2006.

j / k navigate · click thread line to collapse