Sentiment Analysis on Web-Scraped Data (opens in new tab)

(blog.kimonolabs.com)

79 pointsshrig9411y ago8 comments

8 comments

8 comments · 4 top-level

dingdingdang11y ago· 2 in thread

This is very interesting and well written article. Must admit that the fully online nature of the tools discourage rather than encourage in my case: why take the time to learn complexities of something as ephemeral as, what seems like, brand new web service? Especially when even large player like Google routinely retire whole platforms when they are not popular enough.

All the same, the tech itself seems solid and article is as mentioned superb so I'm really just beating the proverbial drum for proper distributed services here (or plain old offline capable apps).

logn11y ago

There are lots of open source tools for these niches.

For sentiment analysis, I'd recommend: http://nlp.stanford.edu/sentiment/code.html

For web scraping, a popular option is Scrapy: http://scrapy.org/

And an unknown web scraping option (and shameless plug): https://github.com/MachinePublishers/ScreenSlicer

For browser automation see Phantom JS or Selenium: http://phantomjs.org/ http://docs.seleniumhq.org/

For an open source IFTTT-inspired project: https://github.com/cantino/huginn/

smartpants11y ago

This is a great list. Thanks!

Profan11y ago· 2 in thread

If you haven't yet attempted to build some sort of sentiment analysis by yourself yet, be it rule-based or on statistical analysis, you should, even just a rudimentary rule based one is a lot of fun to implement, and it works surprisingly well [0].

One of the harder parts of making a decent one based on statistical analysis however is the lack of good training data, other than the analyzed twitter dataset [1] and another movie reviews one [2].

[0] http://fjavieralba.com/basic-sentiment-analysis-with-python....

[1] http://help.sentiment140.com/for-students/

[2] http://www.cs.cornell.edu/people/pabo/movie-review-data/

jlees11y ago

Good training data's partly hard to come by because there's often reasonably poor inter-annotator agreement on sentiment datasets -- that is to say, humans disagree a lot in how we interpret a phrase. What reads like sarcasm to you might read like genuine enthusiasm to another.

It's pretty easy to load up a set of data into a crowdsourcing tool and use microtasks to rate it, but my experiences doing so weren't superb (even restricting to native English speakers alone).

A better source of data is starred reviews where you have the star rating and the review itself -- these come free with a sentiment rating, although plenty of caveats around normalization. There are lots of places with review systems like this and some (like Yelp) even make the data available: https://www.yelp.com/academic_dataset

Profan11y ago

I wasn't aware that yelp provided a dataset, that's very interesting!

Since I had this very problem as I was working on using the output from sentiment analysis to modify sentences so to invert the sentiment polarity (positive to negative, negative to positive), the datasets I found were never general enough (movie reviews, many domain specific terms, hard in the text generation step), or had a lot of noise (twitter dataset).

Though evaluating the system was very hard, due to the reasons you stated, inter-annotator agreement was beyond terrible.

I'll have to look into if other review services expose their data as well, seems appropriate.

hnriot11y ago

this is cool, but you can do the same with beautifulsoup and textblob in far fewer lines of code and you wouldn't need any web services. if textblob isn't your thing there's plenty of svm implementations out there.

for more interesting sentiment analysis approaches check out sentence vectors, that's the current bleeding edge of research in this area.

most sentiment analysis systems need to use an ensemble classifier because the domain of the text is very important. identifying the domain and using the appropriate domain specific model is important.

silentrob11y ago

Very cool. MonkeyLearn looks promising. It would be nice if their docs were a little more clear around uploading CSV and the data structure.

It would also be cool if it did unsupervised learning.

j / k navigate · click thread line to collapse