All the same, the tech itself seems solid and article is as mentioned superb so I'm really just beating the proverbial drum for proper distributed services here (or plain old offline capable apps).
For sentiment analysis, I'd recommend: http://nlp.stanford.edu/sentiment/code.html
For web scraping, a popular option is Scrapy: http://scrapy.org/
And an unknown web scraping option (and shameless plug): https://github.com/MachinePublishers/ScreenSlicer
For browser automation see Phantom JS or Selenium: http://phantomjs.org/ http://docs.seleniumhq.org/
For an open source IFTTT-inspired project: https://github.com/cantino/huginn/
One of the harder parts of making a decent one based on statistical analysis however is the lack of good training data, other than the analyzed twitter dataset [1] and another movie reviews one [2].
[0] http://fjavieralba.com/basic-sentiment-analysis-with-python....
[1] http://help.sentiment140.com/for-students/
[2] http://www.cs.cornell.edu/people/pabo/movie-review-data/
It's pretty easy to load up a set of data into a crowdsourcing tool and use microtasks to rate it, but my experiences doing so weren't superb (even restricting to native English speakers alone).
A better source of data is starred reviews where you have the star rating and the review itself -- these come free with a sentiment rating, although plenty of caveats around normalization. There are lots of places with review systems like this and some (like Yelp) even make the data available: https://www.yelp.com/academic_dataset
Since I had this very problem as I was working on using the output from sentiment analysis to modify sentences so to invert the sentiment polarity (positive to negative, negative to positive), the datasets I found were never general enough (movie reviews, many domain specific terms, hard in the text generation step), or had a lot of noise (twitter dataset).
Though evaluating the system was very hard, due to the reasons you stated, inter-annotator agreement was beyond terrible.
I'll have to look into if other review services expose their data as well, seems appropriate.
for more interesting sentiment analysis approaches check out sentence vectors, that's the current bleeding edge of research in this area.
most sentiment analysis systems need to use an ensemble classifier because the domain of the text is very important. identifying the domain and using the appropriate domain specific model is important.
It would also be cool if it did unsupervised learning.