story

Python Headless Web Browser Scraping on Amazon Linux (opens in new tab)

fruchterco.com

102 pointssteven515812y ago39 comments

39 comments

PhantomJS is brilliant, but Selenium is a questionable choice for this task. For some reason, the creators of Selenium have decided that passing HTTP status codes back through the API is and always will be outside the scope of their project. So if you request a page and it returns 404 you have no way to find out (other than using crude heuristics). This makes Selenium completely unusable for anything I would have used it for.

Fortunately you can do it by using phantomjs directly instead of going through the Selenium WebDriver API. Maybe one day the phantomjs WebDriver API implementation (ghostdriver) will extend the API to pass HTTP status information back to the caller. Until then, this API is unusable (at least for me).

nirvdrum12y ago

Well, I think the matter is a bit more complicated than that. When dealing with a full browser, you fetch a lot of resources. The status code for the first page fetch may be easily obtained, but your API gets very wonky as soon as you want to get status codes for all linked resources. Even if you managed that, any Ajax requests would complicate things, especially if they have deferred loading. And then you have WebSockets.

There are tools, such as BrowserMob Proxy, far better suited for monitoring HTTP traffic. And they'll get you all the headers. You can even capture to HAR so you measure performance.

fauigerzigerk12y ago

Difficult edge cases are never a good reason not to support the 99.9% case.

Also, phantomjs has access to all the information you want and the WebDriver API already has a capabilities negotiation facility.

[Edit] Don't forget that the original URL is the only one supplied by the client of the API. It may be incorrect for very different reasons than all the other resources included by the page itself. That's why it is justified to treat it as a special case.

nirvdrum12y ago

These aren't edge cases. They're asked about constantly. Most people are using Selenium because they care about everything on the page. Otherwise, your stdlib HTTP client would be sufficient.

That aside, if PhantomJS already has the info, you can always fetch it with executeScript.

If you do feel that strongly about the status code part though, I'd urge you to comment on the public draft of the W3C spec: http://www.w3.org/TR/webdriver/

2 more replies

nirvdrum12y ago

To follow up to your edit, that may be true in one case. But it's perfectly reasonable to navigate via clicking, anything in the navigate API, JS actions, meta refreshes, and so on. Even in that one case, most people would expect redirects to be followed and basic auth protected pages to submit. Again, all tractable problems, but ones that are likely better handled by an interstitial layer where you can see the entire chain of requests & responses.

1 more reply

spikels12y ago

Phantomjs handles everything you mention (status codes on large numbers of resources, ajax, deferred loading monitoring and HAR output) with the possible exception of websockets - I have not tried and very little documentation today but it should work. The big limitation is this is WebKit-only right now.

For example: here's the wiki on network monitoring including HAR: https://github.com/ariya/phantomjs/wiki/Network-Monitoring

The API seems pretty clean to me but I guess that is a matter of opinion.

ejk31412y ago

You could always write a simple proxy in python and simply route all of your traffic through that.

See: http://voorloopnul.com/blog/a-python-proxy-in-less-than-100-...

nirvdrum12y ago

BrowserMob Proxy is the go-to tool for use with Selenium:

http://bmp.lightbody.net/

fauigerzigerk12y ago

That would add quite a lot of complexity to achieve something rather trivial.

swinglock12y ago

Aren't you stuck with JavaScript then? Sure, PhanthomJS is awesome, but Python is even in the title, so it's not just a side note.

spikels12y ago

Yes - Unless you are parsing static HTML you will need the rest of the browser's functionality which is implemented as a JavaScript engine. You will also need the original content from the website which will be in JavaScript.

In theory you could recreate this in another language such as Python but you would have to both parse the JavaScript from the website and implement a full browser.

fauigerzigerk12y ago

No, phantomjs includes a webserver module. That's what ghostdriver uses to implement the WebDriver API and you can use it to implement a custom API that you call from Python. So you have to use JavaScript to implement the API, but you can use Python to implement your tests or web data extraction or whatever your actual task is.

slaxo12y ago

For anyone using PhantomJS I'd recommend checking out CasperJS (http://casperjs.org/) . It adds some nice features to PhantomJS and takes out a lot of the pain points

diminoten12y ago

I find it preferable to determine the requests that jQuery is making and perform them myself to extract the necessary data, rather than load up a whole browser just to do the same thing.

Selenium is terrible, performance wise, and requires a significant investment in environment in order to work reliably. I try to avoid it except when I absolutely cannot.

ArbitraryCrow12y ago

I wound up doing this myself, after spending an undue amount of time struggling with a morass of insanely written Javascript. Fiddler proved indispensable for observing the actual interaction with the web server.

brechin12y ago

If you're writing Python and need to do something like this, you could try using Phantompy, a Python port of PhantomJS: https://github.com/niwibe/phantompy

It's still "in an early stage of development" but it's on my list of libraries to keep an eye on for when I have time to tackle the JS-heavy sites of the world.

spikels12y ago

For scraping phantomjs or casperjs is the best way to go but you will have to use some JavaScript [1]. Both give you access to everything a WebKit browser user does with either a Node-style callback syntax (phantomjs) or a procedural/promises-style syntax (casperjs). Easy to setup, simple to use and fast enough for scraping but only WebKit (for now).

For testing on browsers other than WebKit (or vendor specific WebKit edge cases) use Selenium. Harder to setup, more complex, probably faster (still slow for testing) but not limited to WebKit.

[1] Sorry folks but some JavaScript is required to programmatically interacting with the web - also need some HTML and CSS.

xfour12y ago

One more thing, has anyone used BeautifulSoup for forever? Is the project still active? I mean the website is cute and all, but I find pyquery ( Also based on lxml) so much easier with parsing the scraped data.

brechin12y ago

I'd consider it still active, since it was updated on 2013-06-07: https://pypi.python.org/pypi/beautifulsoup4

I prefer using lxml myself, since I like using XPath queries, but bs4 sometimes parses broken HTML better than any of the provided lxml parsers do.

ianhawes12y ago

Something to consider is that the trend the past year has been to use headless browsers over BeautifulSoup, cURL, etc.. because headless browsers are harder to detect by anti-scraping systems and can interpret JavaScript.

takluyver12y ago

That's what the OP is about ;-). But BeautifulSoup isn't a way to retrieve a web page, it's a way to parse HTML. You can get the page with a headless browser, and then transfer the DOM into a BeautifulSoup tree to do your scraping.

takluyver12y ago

BS4, which is still actively developed, got out of the parser game - it can now use lxml (fast) or html5lib (highly tolerant) to parse the HTML. It's kept the convenient interface to dig into the DOM, and it's kept the UnicodeDammit encoding detection system.

616c12y ago

I recently tried to get back into Selenium for a work-related project and, despite its frustrations, it is one my favorite open source gems I found in the last several years. When showing it uninitiated web devs their heads almost exploded from joy and amazement. Your setup with Selenium intrigued me since the pain point for me has become how difficult it is to maneuver some browsers with Selenium IDE to throw together ideas, if that is even encouraged anymore.

phaer12y ago

You are installing some devel-packages, but i don't see anything compiling? Does the selenium installation build native extensions? Then the commands should probably the other way round. Or is phantomjs compiling something on the first run?

Minor nitpick: I don't think it is a good idea to copy a binary directly to /usr/bin, without a package manager. You could just put it into /opt and symlink to /usr/(local/)bin.

kawsper12y ago

The file that he is fetching ( phantomjs-1.9.1-linux-x86_64.tar.bz2 ) is the executables for his platform, with some examples on usage and a readme.

cinquemb12y ago

That doesn't seem like a very safe thing to do... dont they have sc for PhantomJS one can checksum and run ./configure > make > sudo make install?

Wilya12y ago

PhantomJS is pretty big. IIRC, building it takes quite some time. I think they bundle webkit and the necessary parts of Qt, and you'd have to be out of your mind to build that from source if you can avoid it.

Using official distribution packages would be a better idea, but their freshness can vary, especially on RHEL.

1 more reply

j-kidd12y ago

Off topic: it is perfectly fine to install things like PyQt / PySide on a headless server. I suppose the problem is because the distro doesn't provide these packages?

Also, PhantomJS works fine in this case because the binary in the tarball is statically compiled. You can find a whole lot of qt stuffs inside PhantomJS source repository. There ain't no such thing as "truly headless".

techaddict00912y ago

Wow was searching something similar. Actually was trying to build a app which scraps data from movie ticket booking sites and provides data via SMS to user that whether tickets are still available or not. Because everyone doesn't have access to internet in India yet.

@Steven5158 thanks for the share.

If anyone here wants help in building SMS apps do contact me.

keypusher12y ago

We do quite a bit of web scraping / parsing on headless servers with Selenium. What we did was just install some X packages and run VNC server on the headless clients with Firefox. Cool thing about that is you can then go watch the scripts executing if you connect to the VNC session and take a screenshot on failure, etc.

Shakahs12y ago

Brilliant! I've been using Xvfb for headless operation, didn't even consider using VNC.

JimmaDaRustla12y ago

I am under the assumption the python-requests would have the same issue - it does not render the page, it only retrieves the original page response.

Very, very good to know when diving into scraping.

j / k navigate · click thread line to collapse

39 comments

fauigerzigerk12y ago

nirvdrum12y ago

There are tools, such as BrowserMob Proxy, far better suited for monitoring HTTP traffic. And they'll get you all the headers. You can even capture to HAR so you measure performance.

fauigerzigerk12y ago

Difficult edge cases are never a good reason not to support the 99.9% case.

Also, phantomjs has access to all the information you want and the WebDriver API already has a capabilities negotiation facility.

nirvdrum12y ago

These aren't edge cases. They're asked about constantly. Most people are using Selenium because they care about everything on the page. Otherwise, your stdlib HTTP client would be sufficient.

That aside, if PhantomJS already has the info, you can always fetch it with executeScript.

If you do feel that strongly about the status code part though, I'd urge you to comment on the public draft of the W3C spec: http://www.w3.org/TR/webdriver/

2 more replies

nirvdrum12y ago

1 more reply

spikels12y ago

For example: here's the wiki on network monitoring including HAR: https://github.com/ariya/phantomjs/wiki/Network-Monitoring

The API seems pretty clean to me but I guess that is a matter of opinion.

ejk31412y ago

You could always write a simple proxy in python and simply route all of your traffic through that.

See: http://voorloopnul.com/blog/a-python-proxy-in-less-than-100-...

nirvdrum12y ago

BrowserMob Proxy is the go-to tool for use with Selenium:

http://bmp.lightbody.net/

fauigerzigerk12y ago

That would add quite a lot of complexity to achieve something rather trivial.

swinglock12y ago

Aren't you stuck with JavaScript then? Sure, PhanthomJS is awesome, but Python is even in the title, so it's not just a side note.

spikels12y ago

In theory you could recreate this in another language such as Python but you would have to both parse the JavaScript from the website and implement a full browser.

fauigerzigerk12y ago

slaxo12y ago

For anyone using PhantomJS I'd recommend checking out CasperJS (http://casperjs.org/) . It adds some nice features to PhantomJS and takes out a lot of the pain points

diminoten12y ago

I find it preferable to determine the requests that jQuery is making and perform them myself to extract the necessary data, rather than load up a whole browser just to do the same thing.

Selenium is terrible, performance wise, and requires a significant investment in environment in order to work reliably. I try to avoid it except when I absolutely cannot.

ArbitraryCrow12y ago

brechin12y ago

If you're writing Python and need to do something like this, you could try using Phantompy, a Python port of PhantomJS: https://github.com/niwibe/phantompy

It's still "in an early stage of development" but it's on my list of libraries to keep an eye on for when I have time to tackle the JS-heavy sites of the world.

spikels12y ago

For testing on browsers other than WebKit (or vendor specific WebKit edge cases) use Selenium. Harder to setup, more complex, probably faster (still slow for testing) but not limited to WebKit.

[1] Sorry folks but some JavaScript is required to programmatically interacting with the web - also need some HTML and CSS.

xfour12y ago

brechin12y ago

I'd consider it still active, since it was updated on 2013-06-07: https://pypi.python.org/pypi/beautifulsoup4

I prefer using lxml myself, since I like using XPath queries, but bs4 sometimes parses broken HTML better than any of the provided lxml parsers do.

ianhawes12y ago

takluyver12y ago

616c12y ago

phaer12y ago

Minor nitpick: I don't think it is a good idea to copy a binary directly to /usr/bin, without a package manager. You could just put it into /opt and symlink to /usr/(local/)bin.

kawsper12y ago

The file that he is fetching ( phantomjs-1.9.1-linux-x86_64.tar.bz2 ) is the executables for his platform, with some examples on usage and a readme.

cinquemb12y ago

That doesn't seem like a very safe thing to do... dont they have sc for PhantomJS one can checksum and run ./configure > make > sudo make install?

Wilya12y ago

Using official distribution packages would be a better idea, but their freshness can vary, especially on RHEL.

1 more reply

j-kidd12y ago

Off topic: it is perfectly fine to install things like PyQt / PySide on a headless server. I suppose the problem is because the distro doesn't provide these packages?

techaddict00912y ago

@Steven5158 thanks for the share.

If anyone here wants help in building SMS apps do contact me.

keypusher12y ago

Shakahs12y ago

Brilliant! I've been using Xvfb for headless operation, didn't even consider using VNC.

JimmaDaRustla12y ago

I am under the assumption the python-requests would have the same issue - it does not render the page, it only retrieves the original page response.

Very, very good to know when diving into scraping.

j / k navigate · click thread line to collapse