This is interesting to me because it is extremely difficult to change the vocabulary I use in writing and speaking. Being able to estimate the amount of similarity between two pieces of text would be useful.
The closest I can think of right now would be the proprietary algorithms used to check for plagiarism (for schools and universities, for instance).
Are there any publicly available algorithms for this? Where can I go to learn more? (Academic journals?) Am I just DDGing the wrong search terms?
http://www.cs.berkeley.edu/~dawnsong/papers/2012%20On%20the%...
There's a lot of public work on the topic, but it looks like right now the best place to look is still in academic papers (I don't know of any open source libraries, for example).
I was a close friend of Emil for a long time. Before he passed, I had gotten busy with work and hadn't spoken to him in a while. I saw some of our mutual friends the weekend before it happened and had planned to call him. Really wish I had made that call sooner.
"JGAAP is intended to tackle two different problems, firstly to allow people unfamiliar with machine learning and quantitative analysis the ability to use cutting edge techniques on their text based stylometry / textometry problems, and secondly to act as a framework for testing and comparing the effectiveness of different analytic techniques' performance on text analysis quickly and easily."
[1]: http://evllabs.com/jgaap/w/
Looks like there are some other recommendations at http://evllabs.com/jgaap/w/index.php/FAQ#What_other_tools_ar...
I'm surprised there isn't any open source effort in this area yet (I couldn't find any either). It's just as important as TOR and other anonymity services, since it affects not just passively consuming information, but actively creating it.
UPDATE: fix typos.
From the conclusions: "Translation with widely available machine translation services does not appear to be a viable mode of circumvention. Our evaluation did not demonstrate sufficient anonymization and the translated document has, at best, questionable grammar and quality."
English -> Filipino (Tagalog) -> Chinese-simplified (Mandarin?) -> English
I remember reading an article about a year ago (NSA) to identify the
user, based on how they are written, vocabulary, spelling errors,
grammar, language, and so on.
It is interesting to me, because it is difficult to change the written
and spoken word in use. It can be estimated that there are between two
characters similar amount of help.
Recently I can think of now is to check plagiarism (used in schools and
universities, for example) is proprietary algorithm.
Are there any public this algorithm? I can find out more information?
(Academic journals?) I just DDGing wrong search terms?
I have to say, this is a great idea. There was some information lost in
transit, but most of my thoughts came through (albeit broken). It's probably
worse since I used multiple intermediaries and Mandarin doesn't
map onto English (or vice-versa) in grammar or vocabulary.edit: A site for this exists. http://ackuna.com/badtranslator
Original:
In the beginning God created the heaven and the earth. And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.
[a half dozen languages later...]Result:
God created heaven and earth in the beginning. And the earth was formless and empty, and darkness was on the face of the abyss, man. Answer the Spirit of God on the water surface.If you chain through European languages you barely lose anything. Doing English->Dutch->German->English is a good set to use.
Basically the first step would be shingling the text (choosing a sampling domain) and generating a MinHash struct (computationally cheap) which can then be used to find the "similarity" between sets, or, the "Jaccard Index."
If you're clever about this, you can use HyperLogLogs to encode these MinHash structs gaining a great deal of speed with a marginal error rate, all while allowing for arbitrary N-levels of intersection.
If you're looking to build a model to analyze two (or N) text bodies for stylometric similarities, I'd approach the problem in two steps:
1) Minimize the relevant input text.
- Use a bernoulli/categorical distribution to weight words according to uniqueness--NLP and sentiment extraction techniques may also help
- Design a markov process to represent more complex phrasing patterns for the text as a whole
- Filter by a variable threshold to minimize the resulting set of shingles/bins/"interesting nodes" into a computationally-manageable #
2) Use an efficient MinHash intersection to compute a similarity vector (0-1) for the two texts.
I think given the prevalence of training data (I mean, what's more ubiquitous than the written word...) you could probably tune this to a reasonable accuracy and efficient complexity.
Just a 5m thought exercise, but if anyone else has ideas I'd be curious as well :)
> "... use HyperLogLogs to encode these MinHash structs gaining a great deal of speed with a marginal error rate, all while allowing for arbitrary N-levels of intersection"
Thanks
Let's say we have two sets, and we're trying to find out how similar they are:
setOne = ['the','brown','fox','jumped','a','log']
setTwo = ['the','quick','brown','log','jumped','over','the','fox']
You could use an array intersection when its small, but if you want to do this efficiently at scale you need to take advantage of probabilistic data structures.Let's say you created a HyperLogLog for each set:
setOne = (bitfield representing setOne)
setTwo = (bitfield representing setTwo)
Now HyperLogLogs are cool, because you can merge them together w/o losing anything. You can't retrieve the data, but you can efficiently check if a value exists inside. You can have a false positive, but never a false negative.You might first try simple combinatorics (a la, similarity ~= |A or B| / |A ∪ B|) however this can get hairy depending upon the representation of the HyperLogLog (sparse/dense) and its respective cardinality.
Eventually you realize you can't accept an exponentially-compounding error rate, but still need the raw efficiency, and thus you sacrifice by doubling your storage cost.
Now instead of just the initial sets, you'd also build a parallel MinHash struct:
setOne_minhashes = [<int>, <int>, <int>]
setTwo_minhashes = [<int>, <int>, <int>]
setOne = (bitfield representing setOne)
setTwo = (bitfield representing setTwo)
setOneMinHash = (bitfield representation of setOne minhashes)
setTwoMinHash = (bitfield representation of setTwo minhashes)
I won't go into excessive detail about the minhash algorithm itself, but essentially it provides a way to sample a large set of values by selecting/retaining the smallest output hashes. What that means is, you can then intersect the minhash bitfields as many times over as you like, and extract a predictably accurate similarity index in linear time with definable confidence bounds.I see the term I was looking for was LSM (language style matching). That should help me do more research.
Thanks for the link.
Brownie points for HN admins if they add text analysis as feedback when we press submit.
I'm tired of signing up with a brand new HN account every few months to cover my tracks after embarrassing myself with less than noble posts.
There is something about the stop word use pattern that makes them harder to forge.
I've never tried this and I don't know much more about it than that, so I strongly suggest you also find papers that treat authorship attribution by stop words.
There's an interesting research paper about their algorithms here: https://www.cs.auckland.ac.nz/courses/compsci725s2c/archive/...
And if you search for "Turnitin Plagiarism Algorithm" I'm sure you'll find a few more resources.
http://evllabs.com/jgaap/w/index.php/Main_Page
[1] http://blogs.wsj.com/speakeasy/2013/07/16/the-science-that-u...
Edit: https://en.wikipedia.org/wiki/The_Cuckoo%27s_Calling#Authors...
> However, it was later reported that Rowling's authorship was leaked to a Times reporter via Twitter by the friend of the wife of a lawyer at Russells Solicitors, who had worked for Rowling. The firm has since apologised[29] and made a "substantial charitable donation" to the Soldiers' Charity as a result of legal action brought by Rowling.[30]
1) tokenize each text into a different bag(set) of words.
2) Compute the Jaccard index[1] using the two sets.
Here's another
1) tokenize each text into a multi-bag(set) of words, keeping track of token frequency
2) keeping the token frequency, order the sets into lists
3) map the lists of words onto an n-dimensional space (where n is say...all of the words into the two documents) as vectors
4) compute the cosine similarity [2]
Here's another:
1) tokenize the texts into two bags of words
2) compute the set difference going both ways.
3) does either difference contain discriminator tokens that rule it out as being from that person
4 (optional)): extend to 2-3-n-grams
Here's another (a variant of the one above):
1) compute 1-2-3-n-grams from one of the texts
2) insert the n-grams into a set
3) compute the same for the second document and test for set membership
4) compute the number of total n-grams from your second document
5) compute (non-in-set/total-n-grams) * 100 to yield a "uniqueness" measure
6) determine if the second document is "unique" enough
And another:
1) assuming you have a sample corpus from a writer and want to know if a new text belongs in that corpus
2) follow the method above but for step #1 and 2 do it with the entire reference corpus
And another:
1) produce an ontology of discriminator terms and categories unique to the writer
2) use an (named entity recognition) NER tool of some kind to find those terms in each document
3) use the set of found terms as an alternative to a bag of words for the Jaccard or Vector models above
You may need to play with stopword list removal, tokenization schemes and n-gram windows (for example, omitting 1-grams might focus the analysis on phrase usage vs. vocabulary usage)
The same group also has created a text obfuscation tool called anonymouth that helps you obfuscate your word choices, but it has still yet to be released. https://psal.cs.drexel.edu/index.php/JStylo-Anonymouth
I started out with the code provided on https://github.com/mac389/ToxTweet/blob/master/textanalyzer.... I use it in a private project, but the results are promising!
I am sure the Truecrypt authors contributed to more than one project.
JStylo that was already mentioned is based on JGAAP. You have some more here: http://evllabs.com/jgaap/w/index.php/FAQ#What_other_tools_ar...
http://theory.stanford.edu/~aiken/publications/papers/sigmod...
and this article by Schneier - Identifying People By Their Writing Style - https://www.schneier.com/blog/archives/2011/08/identifying_p...
There are 2 main ways for assessing author attribution. One is through stylistic markers, where you look for a set of predefined features. The is average length per paragraph, or the number of times 'whenever' is used. This is highly language dependant.
The other way is through character n-gram analysis. You chose for which N you want to harvest N-grams and your author profile is the frequency of top 2000 n-grams and you compare this profile with a documents top 2000 n-grams and the profile with the shortest distance is your match.
Robert Layton has a tutorial and some code on N-gram attribution on Github:
* https://github.com/robertlayton/authorship_tutorials
* https://github.com/robertlayton/author-detection
And here's a list of papers I've reviewed while doing a similar project.
[1] Shlomo Argamon, Moshe Koppel, Jonathan Fine, and Anat Rachel Shimoni. Gender, genre, and
writing style in formal written texts.
23(3):321–346, 2003.
[2] John F Burrows. ‘an ocean where each kind...’: Statistical analysis and some major determinants
of literary style. Computers and the Humanities, 23(4-5):309–321, 1989.
[3] Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis, and Sokratis Katsikas. Source
code author identification based on n-gram author profiles. In Artificial Intelligence Applica- tions and Innovations, pages 508–515. Springer, 2006.
[4] Sheena Gardner and Hilary Nesi. A classification of genre families in university student writing.
Applied linguistics, 34(1):25–52, 2013.
[6] John Houvardas and Efstathios Stamatatos. N-gram feature selection for authorship identifica- tion. In Artificial Intelligence: Methodology, Systems, and Applications, pages 77–86. Springer,
2006.
[7] Patrick Juola. Authorship attribution. Foundations and Trends in information Retrieval,
1(3):233–334, 2006.
[8] Vlado Kešelj, Fuchun Peng, Nick Cercone, and Calvin Thomas. N-gram-based author profiles
for authorship attribution. In Proceedings of the conference pacific association for computational
linguistics, PACLING, volume 3, pages 255–264, 2003.
[9] Maarten Lambers and Cor J Veenman. Forensic authorship attribution using compression dis- tances to prototypes. In Computational Forensics, pages 13–24. Springer, 2009.
[11] Fiona J Tweedie and R Harald Baayen. How variable may a constant be? measures of lexical
richness in perspective. Computers and the Humanities, 32(5):323–352, 1998.
[12] Cor J Veenman and Zhenshi Li. Authorship verification with compression features.
[13] Rong Zheng, Jiexun Li, Hsinchun Chen, and Zan Huang. A framework for authorship identifi-
cation of online messages: Writing-style features and classification techniques. Journal of the
American Society for Information Science and Technology, 57(3):378–393, 2006.