What sorts of scraping do you find yourself doing?
What are your biggest frustrations?
What's the coolest hack you've encountered while scraping?
My cofounder and I have been working on a domain-specific language to make scraping quick and easy, so that you can write, say, 100 different website scrapers in less time -- http://dartbanks.com/simplescrape . We'd love feedback on this approach.
I use it to scrape television listing data (http://ktyp.com/rss/tv/ was my old site, and http://code.google.com/p/listocracy/) and more recently to scrape resume data from job posting websites for a (YC-rejected :P ) side project I'm working on.
The hardest part I've encountered with scraping is odd login and form setups. For example Monster.com uses an outside script to attempt to fool scraping. A couple other sites use bizarre redirecting across pages. Also AJAX certainly has changed the way a lot of screen scraping is done.
Finally, the most useful tool I've used is LiveHTTPHeaders (http://livehttpheaders.mozdev.org/) which is great for following how a site operates.
Edit: For PHP, another interesting tool for scraping is htmlSQL (http://www.jonasjohn.de/lab/htmlsql.htm) which allows HTML to be searched using SQL like syntax.
I'd be interested in how you tackle this one. I've always used something like Perl/Curl/wget etc for scraping, but (like you say) JavaScript messes that up. I've had moderate success using GreaseMonkey and regexps in JavaScript code, but it's a bit fragile. I'm thinking of using GreaseMonkey + jQuery, since that should allow me to select DOM elements very easily. But if you have a better way, please share :)
My biggest frustrations, right now, are really around getting data from lots of different websites in subtly varied forms. This is a tough problem to automate. I certainly haven't found any tools that make it simple.
I'd be happy with a 50% correctness rate, looking for very loose patterns. I just haven't found a tool and, while I have some ideas for how to do it, it's a major project in itself to produce something that can do this.
For example, imagine writing a scraper that would parse out every food recipe online. Whether it be in forums, blogs, etc, etc. That's the sort of scraping I'm looking for and the best I'd have is putting together a neural network or other system that I can train against human-provided data. Unfortunately getting such a system to partition the text to just the recipe would be difficult.
After that it just becomes an issue of removing the cruft around the recipe. I would start with common stuff: splitting things up by <br> or inner <p> since if someone is gonna have something before / after their recipe (say, on a forum) it'll be split up with blank lines somehow (well, usually). This will be another time to use things like close matching and teaching the algorithm what it gets right/wrong so it can weigh things as recipe/not better in the future.
If you do all this and add more specific edge cases as time goes on, I think you'd be able to maintain a 50% correctness rate pretty easily.
Edit: And it'd be much cheaper than a neural network ;)
That said, it's likely do-able, as long as you don't need perfect results. There are plenty of sites around that seem to be doing things along these lines - but AFAIK none of them have open-sourced their code.
Meanwhile, I've been a coward and stuck to beautiful soup for my scraping projects. In the short term, it works out faster than trying to be too clever.
The login procedure is gothic and took a lot of wiresharking to figure out. .Net has pretty good scraping-support in the WebClient and HttpWebRequest classes found in the System.Net namespace.
Will publish results soon... :-)
Quote: "Reproduction is authorised provided the source is acknowledged. However, to prevent disruptions in service to our normal users from bulk downloads of TED data, we reserve the right to check for, and block, attempts to download excessive quantities of documents, particularly using automated or robot-like tools."
... they apparently chose not to exercise that right in this case, the scrape completed last night (all 18 GB of it).
I took a quick glimpse at beautiful soup and it seems to be doing something similar - someone let me know if this is correct.
BeautifulSoup is nothing unique, but it can handle malformed data that saves you a ton of hassle.
http://nostarch.com/frameset.php?startat=webbots
http://www.oreilly.com/catalog/spiderhks/
That probably covers the topic of scraping pretty exhaustively.
http://pyprocessing.berlios.de/
http://nltk.org/index.php/Main_Page
The biggest hurdle is in understanding how to navigate through a complex site - such as a forum, real estate etc. We have created a visual tool for this however there are other methods. Look at dapper.net as this is useful.
I am wondering if there could be some collaborative effort from the minds on this site to create something unique and groundbreaking
1. Take the time to get very familiar with regular expressions. If you think you know your regex pretty well, go to the docs or get a book and find three things you don't understand and understand them fully. Then find three more.
2. The data doesn't have to be perfect. In most cases you can clean it after you've stored it. It's generally better to get more than you think you might need (in terms of data or html/formatting around the data) and then go back and clean it later
3. Generally, my most successful data mining algorithms involve a lot of hacks. There are very few clean formulas...usually I have to play with the data for awhile and fix a lot of one offs and special cases and then it ends up coming out ok
Yahoo Pipes is also fun to play with; and Firebug is the scraper's best friend.
Right now I'm working on scraping public LinkedIn data. In the past I've done Craigslist and Twitter. I haven't done anything really hard, though -- mostly things that can be read as XML.
Here's a few cool links if you're interested in scraping with Ruby: http://del.icio.us/jeremyraines/scraping
Now that I have used it to extract data out of many different types of pages. I'm looking to turn it into a dsl. So that the code looks natural. Currently it's just functions which search for tags in html. You can then easily filter some or others. here is an example
(extract-all page [(and (tagp _ :a) (classp _ "jdtd4"))])
It lets you specify which items on an initial / prototype page you want to scrape, and then it builds up a set of rules than then work on future similar instances of that page. Good for scraping eBay, Google, stuff like that.
I created a mashup of AIM + Flicker.
If you use AIM 6 or AIM lite send a message to MyPictureBuddy
then send a message and enjoy.
basically You type a keyword and it gots to flicker and retrieves image information to display pictures right inside your AIM chat session.
I also have another Bot that parses HackerNews XML and then display it on the chat session. The bot name is
HackerNewsYC
In order to keep website users from having 2 accounts I created an interface that scrapes the sign in, sign up, lost password, change password, and couple other screens of the internal system. So when users come to the website and "login" they're actually logging in to the internal system and I just record their session from the internal system so I can masquerade as them as they go about their business.
It's not going to support 100s of connections per second but it gets the job done for their traffic levels (36,000 views the first day of launch).
We scrape public webpages (with an option for content owners to restrict access), and we use the .NET Framework' in-built socket implementation (System.Net namespace) for fetching remote content.
Our biggest frustration was to deal with invalid charset/content encoding of the source webpages. But we resolved it using a custom module. Now everything we parse is unicode (utf-8)!
The collest hack we've encountered while scraping is utilizing the Conditional GET behavior using the HTTP If- Modified-Since header.
I've used it a lot - it's really great.
Plan on scraping past billboard charts to let people listen to the radio back in time.
I use Mechanize, both in its Ruby and Python forms (I prefer Ruby) and plain old regular expressions to get the information that I want. Often times I will use a divide and conquer strategy by removing part of the web page (for example, the <head>) and successively paring it down to what I really want.
Javascript can be a problem. What I normally do is actually read the Javascript on the page, and then recreate that behavior in my Ruby code. Often times this means simply setting some form values (usually hidden) and then submitting the form.
To get the most bang for my buck (developer time wise) I would visit each site with firebug in inspect mode, hover the data I want to extract. From there I figure out how I would style that element, and because Hpricot supports CSS selectors I've straight away got a method for pulling that data out of the page.
Maybe have it so you can edit the sample text and language and see the results all on a web page?
I can't say I'm crazy about the syntax, but I'll give this a try when I get home.