How to compare two articles..

2 pointsbdouglas17y ago4 comments

hi...

trying to figure out what ways are there to compare/determine if two separate articles are the same...

curently researching semantic analysis, but figured i'd turn here as well...

thoughts/comments...

thanks

4 comments

4 comments · 3 top-level

jfarmer17y ago· 1 in thread

What do you mean "are the same?"

bdouglasOP17y ago

by the "are the same", i'm trying to determine if the articles would basically be talking about the same/similar topic...

this of course would/might take into account similar phrases/words, possibly similar titles, similar timeframe of creation, possible a priori knowledge about the author (past works), etc...

thanks

raffi17y ago

Hi Bd, Ironic, yesterday I uploaded a tech-demo of something I call kindling which attempts to correlate articles against news feeds from social websites.

I read a book called Collective Intelligence by Tony Segaran. Its basically machine learning for dummies, very example heavy, all in Python.

He talks about clustering to group like things together in an unsupervised way. The way this works is to build a vector of words from each article and compare these using something known as pearson distance. The vector of words is known as a feature set. Early on you create this vector in a naive way (i.e. eliminate words that don't show up enough and words that show up too much). At the end of the book he talks about feature detection (which I assume is building this vector in a smarter way).

The book really helped me. Pearson correlation is pretty easy to grasp and implement as well.

Good luck.

MaysonL17y ago

There's a great Google tech talk on this subject:

http://www.youtube.com/watch?v=AyzOUbkUf3M

j / k navigate · click thread line to collapse