Web Scraping with Electron (opens in new tab)

(en.jeffprod.com)

57 pointstazeg957y ago52 comments

52 comments

34 comments · 5 top-level

Dunedan7y ago· 12 in thread

> Is there a better way to surf the web, retrieve the source code of the pages and extract data from them ?

Yes, of course! To get the source code of a web site you don't need a browser and all its complexity. It makes me so sad how far we have come in terms of unnecessary complexity for simple tasks.

If you want to extract data from web pages without requiring hundreds of megabytes for something like Electron, there are lots of scraping libraries out there. There are for example at least two good Python implementations: Scrapy[1] and BeautifulSoup[2].

[1]: https://scrapy.org/

[2]: https://www.crummy.com/software/BeautifulSoup/

fake-name7y ago

This is nice sounding, but many modern web-pages use extensive client-side rendering. Sure, you can work around that without needing a full JS environment, but doing so is ad-hoc and you wind up having to write complex code on a per-site basis.

I do a bunch of web-scraping for hobby shit, and I've love to be able to not have to shell out to chromium for some sites, but unfortunately the modern web basically means you're stuck with it.

harryf7y ago

Also sites with some kind of 2FA / oauth happening. This _looks_ like it would be possible to login manually then start scraping.

dstick7y ago

Correct me if I’m wrong but neither one supports Javascript rendered pages?

You’re right in the overhead though; I’d stay miles away from Electron for scraping but you’ll need more than a CURL wrapper to properly fetch data in all shapes and sizes :) Headless Chromium does do the trick in that regard.

danpalmer7y ago

With web scraping you typically don’t want the visuals anyway. JS rendered applications are usually easier to scrape because they have data in a more raw or canonical format available somewhere to do that rendering.

1 more reply

tazeg95OP7y ago

Sure, but i meant to build a portable app, for end users who are not coders, with a GUI, and for a dedicated purpose, like for exemple navigating on facebook.

So i will edit this question to this : Is there a better way to code a portable application with a graphical user interface to scrape a given site ?

Thanks for your comment.

sansnomme7y ago

Look up robot process automation and visual web scraping. Web scraping without having to write code is a well established field. Just not very popular with the HN crowd for obvious reasons.

Some example would be Scrapinghub's Portia system and the Kantu startup. There are also established players like UIPath and Visualwebripper.

rasengan7y ago

You can access the html of the website and use regular expressions.

2 more replies

chinathrow7y ago

> like for exemple navigating on facebook.

What would you want to scrape there which is not against their ToS and a violation of user privacy in general?

2 more replies

dig17y ago

Good luck with that :) Any modern website requires javascript interpreter on client side, so unless you provide some sort of javascript interpretation (which can be messy), you'll be able to scrape only simple content with scrapy/BS.

bootloop7y ago

I mean, I guess the point is that it allows you to scrap data after it was rendered by JS.

patrickm17y ago

you can always use something like proxycrawl to scrape javascript without using Electron. And it's compatible with scrapy

IloveHN847y ago

You forgot curl in C/C++ which is the most advanced tool out there

TicklishTiger7y ago· 9 in thread

I wish there was an easy way to send commands to the console of a browser.

That would be all I need to satisfy all my browser automation tasks.

Without installing and learning any frameworks.

Say there was a linux command 'SendToChromium' that would do that for Chromium. Then to navigate to some page one could simply do this:

SendToChromium location.href="/somepage.html"

SendToChromium should return the output of the command. So to get the html of the current page, one would simply do:

SendToChromium document.body.innerHTML > theBody.html

Ideally the browser would listen for this type of command on a local port. So instead of needing a binary 'SendToChromium' one could simply start Chromium in listening mode:

chromium --listen 12345

And then talk to it via http:

curl 127.0.0.1:12345/execute?command=location.href="/somepage.html"

SSchick7y ago

What you are describing is exactly what puppeteer uses internally, you might want to explore their code base.

dlkinney7y ago

While not currently "easy", there exists the Chrome Devtools Protocol.[0] I'm not aware of a CLI utility that communicates with it, but it wouldn't be impossible to make one that fulfills what you're looking for. A second tool could then act as a REST proxy, if calling the commands via curl is really your jam.

I think you've given my weekend some purpose. Lemme see what I can pull together...

[0] https://chromedevtools.github.io/devtools-protocol/

TicklishTiger7y ago

Yes, that might work. Maybe an even better approach is to use chromium-chromedriver.

I just got it working like this:

    apt install chromium-chromedriver
    chromedriver

This seems to create a service that listens on port 9515 for standardized commands to remote control a chromium instance. The commands seem to be specified by the W3C:

https://www.w3.org/TR/webdriver/

I got it to open a browser with this curl command:

    curl  -d '{ "desiredCapabilities": { "caps": { "nativeEvents": false, "browserName": "chrome", "version": "", "platform": "ANY" } } }'  http://localhost:9515/session

I have not yet figured out how to send javascript commands though.

androidgirl7y ago

You can do this with Selenium pretty easily, and low level webdrivers support it too.

TicklishTiger7y ago

I would give low level webdrivers a go. But so far I have not even figured out how to install one for Chromium on Debian.

    apt install chromedriver

gives me:

    Package 'chromedriver' has no installation candidate

There is something called "chromium-chromedriver".

Let me try that ... one moment ...

Ok. So I start it via:

    apt install chromium-chromedriver

Now according to the docs, this should create a browser:

    curl  -d '{ "desiredCapabilities": { "caps": { "nativeEvents": false, "browserName": "chrome", "version": "", "platform": "ANY" } } }'  http://localhost:9515/session

Ha! It works!

So chromedriver might be a solution!

fake-name7y ago

I've written something[1] that can basically do this, though it's non-interactive.

The CLI interface is somewhat incomplete as-of-now, but it'd be fairly easy to add more comprehensive tools.

https://github.com/fake-name/ChromeController

dstick7y ago

What would you do with this solution that can’t already be done with headless browsers, apart from looking at it?

It’s very much like that already: write script, send to browser, let it do it’s thing, run javascript code if you want and get the final renderend HTML and console output.

TicklishTiger7y ago

    What would you do with this solution

Automate the browser

    that can’t already be done

I did not say it can't be done now. But I don't want the overhead of pupeteer, selenium plus the client libraries to control them. These things are fricking monsters. And they will go out of fashion at some point. Simple javascript commands will not.

    with headless browsers

I don't want headless. I just want to automate.

3 more replies

aasasd7y ago

With ‘native messaging,’ you can have your program communicate with your extension, so the extension could then do everything that's available to it via the API.

Won't be surprised if an extension like this already exists.

aboutruby7y ago· 3 in thread

Interesting but seems less powerful than my current setup:

- I have mitmproxy to capture the traffic / manipulate the traffic

- I have Chrome opened with Selenium/Capybara/chromedriver and using mitmproxy

- I then browse to the target pages, it records the selected requests and the selected responses

- It then replays the requests until they fail (with a delay)

I highly recommend mitmproxy, it's extremely powerful: capture traffic, send responses without hitting the server, block/hang requests, modify responses, modify requests/responses headers.

Then higher level interfaces can be built on top, Selenium allows you to load Chrome extensions and execute Javascript on any page for instance. You can also manage many tabs at the same time.

I could make a blog post/demo if people are interested

kuhhk7y ago

A blog article would be nice. Sounds interesting, but I’m having a hard-time understanding. If it’s replaying requests, how do you get it to do things like go to the next pagination and click on all of the next paginated results?

aboutruby7y ago

In my case I can't do the pagination automatically so I have to fetch the pages myself to then have them replayed.

In most cases you would capture the request and change the "page=" parameter (either for an HTML page or an API).

You could also use selenium to click on each "next page". Could be parallelized with multiple tabs / windows.

The only website that blocks me is Bloomberg because they detect mitmproxy (I didn't care enough to make mitmproxy harder to detect).

Another detail is that regular Chrome doesn't let you load insecure certificates while chromedriver allows that.

Anyway, I will write about all that, I already posted some code on my Twitter: https://twitter.com/localhostdotdev (that I will turn into a blog).

marvel_boy7y ago

>I could make a blog post/demo if people are interested

yes, please !

SSchick7y ago· 3 in thread

Are there any other advantages over things like webdriver or puppeteer?

nkozyra7y ago

Not really.

I also have no idea what cheerio brings to the table here.

Seems like a hefty solution to web scraping

sanxiyn7y ago

No, there aren't. Just use WebDriver.

galacticdessert7y ago

One thing about Selenium and Puppeteer is that they trigger captcahs on some websites, making scraping impossible. This would perhaps fix it?

CGamesPlay7y ago· 2 in thread

I'm going to plug my app that does scraping with Electron: https://github.com/CGamesPlay/chronicler

To the commenters who don't understand why this is necessary:

- It reliably loads linked resources in a WYSIWYG fashion, including embedded media and other things that have to be handled in an ad-hoc fashion when using something like BeautifulSoup.

- It handles resources loaded through JavaScript, including HTML5 History API changes.

aloer7y ago

Could you explain what electron offers here over for example a browser plugin? I'm not that familiar with limitations of WebExtension APIs

It looks like an interesting project but only for a few selected sites. For more random browsing I believe I would be too security conscious (https://electronjs.org/docs/tutorial/security) to allow it

Might be unlikely that some random script on a random site will target Electron but you never know

CGamesPlay7y ago

Well in Chrome you can't hook into the network layer to record/replay requests like I've done here (you could fake it by overriding the ServiceWorker, possibly, but this feels brittle and I'm not sure if it's possible either). I'm not familiar with Firefox extensions but I'm given to understand you likely could implement this project as a Firefox extension.

Locking down Electron apps to be safe on the larger web is certainly one area where I think electron could do a lot better. I think my project has followed all of the recommendations and should be safe, but I agree with you that it feels like a bigger attack surface. I personally wanted it to support offline browsing of documentation sites, which are generally pretty "safe" from that perspective.

j / k navigate · click thread line to collapse

52 comments

34 comments · 5 top-level

Dunedan7y ago· 12 in thread

> Is there a better way to surf the web, retrieve the source code of the pages and extract data from them ?

Yes, of course! To get the source code of a web site you don't need a browser and all its complexity. It makes me so sad how far we have come in terms of unnecessary complexity for simple tasks.

[1]: https://scrapy.org/

[2]: https://www.crummy.com/software/BeautifulSoup/

fake-name7y ago

I do a bunch of web-scraping for hobby shit, and I've love to be able to not have to shell out to chromium for some sites, but unfortunately the modern web basically means you're stuck with it.

harryf7y ago

Also sites with some kind of 2FA / oauth happening. This _looks_ like it would be possible to login manually then start scraping.

dstick7y ago

Correct me if I’m wrong but neither one supports Javascript rendered pages?

danpalmer7y ago

1 more reply

tazeg95OP7y ago

Sure, but i meant to build a portable app, for end users who are not coders, with a GUI, and for a dedicated purpose, like for exemple navigating on facebook.

So i will edit this question to this : Is there a better way to code a portable application with a graphical user interface to scrape a given site ?

Thanks for your comment.

sansnomme7y ago

Look up robot process automation and visual web scraping. Web scraping without having to write code is a well established field. Just not very popular with the HN crowd for obvious reasons.

Some example would be Scrapinghub's Portia system and the Kantu startup. There are also established players like UIPath and Visualwebripper.

rasengan7y ago

You can access the html of the website and use regular expressions.

2 more replies

chinathrow7y ago

> like for exemple navigating on facebook.

What would you want to scrape there which is not against their ToS and a violation of user privacy in general?

2 more replies

dig17y ago

bootloop7y ago

I mean, I guess the point is that it allows you to scrap data after it was rendered by JS.

patrickm17y ago

you can always use something like proxycrawl to scrape javascript without using Electron. And it's compatible with scrapy

IloveHN847y ago

You forgot curl in C/C++ which is the most advanced tool out there

TicklishTiger7y ago· 9 in thread

I wish there was an easy way to send commands to the console of a browser.

That would be all I need to satisfy all my browser automation tasks.

Without installing and learning any frameworks.

Say there was a linux command 'SendToChromium' that would do that for Chromium. Then to navigate to some page one could simply do this:

SendToChromium location.href="/somepage.html"

SendToChromium should return the output of the command. So to get the html of the current page, one would simply do:

SendToChromium document.body.innerHTML > theBody.html

Ideally the browser would listen for this type of command on a local port. So instead of needing a binary 'SendToChromium' one could simply start Chromium in listening mode:

chromium --listen 12345

And then talk to it via http:

curl 127.0.0.1:12345/execute?command=location.href="/somepage.html"

SSchick7y ago

What you are describing is exactly what puppeteer uses internally, you might want to explore their code base.

dlkinney7y ago

I think you've given my weekend some purpose. Lemme see what I can pull together...

[0] https://chromedevtools.github.io/devtools-protocol/

TicklishTiger7y ago

Yes, that might work. Maybe an even better approach is to use chromium-chromedriver.

I just got it working like this:

    apt install chromium-chromedriver
    chromedriver

This seems to create a service that listens on port 9515 for standardized commands to remote control a chromium instance. The commands seem to be specified by the W3C:

https://www.w3.org/TR/webdriver/

I got it to open a browser with this curl command:

    curl  -d '{ "desiredCapabilities": { "caps": { "nativeEvents": false, "browserName": "chrome", "version": "", "platform": "ANY" } } }'  http://localhost:9515/session

I have not yet figured out how to send javascript commands though.

androidgirl7y ago

You can do this with Selenium pretty easily, and low level webdrivers support it too.

TicklishTiger7y ago

I would give low level webdrivers a go. But so far I have not even figured out how to install one for Chromium on Debian.

    apt install chromedriver

gives me:

    Package 'chromedriver' has no installation candidate

There is something called "chromium-chromedriver".

Let me try that ... one moment ...

Ok. So I start it via:

    apt install chromium-chromedriver

Now according to the docs, this should create a browser:

    curl  -d '{ "desiredCapabilities": { "caps": { "nativeEvents": false, "browserName": "chrome", "version": "", "platform": "ANY" } } }'  http://localhost:9515/session

Ha! It works!

So chromedriver might be a solution!

fake-name7y ago

I've written something[1] that can basically do this, though it's non-interactive.

The CLI interface is somewhat incomplete as-of-now, but it'd be fairly easy to add more comprehensive tools.

https://github.com/fake-name/ChromeController

dstick7y ago

What would you do with this solution that can’t already be done with headless browsers, apart from looking at it?

It’s very much like that already: write script, send to browser, let it do it’s thing, run javascript code if you want and get the final renderend HTML and console output.

TicklishTiger7y ago

    What would you do with this solution

Automate the browser

    that can’t already be done

    with headless browsers

I don't want headless. I just want to automate.

3 more replies

aasasd7y ago

With ‘native messaging,’ you can have your program communicate with your extension, so the extension could then do everything that's available to it via the API.

Won't be surprised if an extension like this already exists.

aboutruby7y ago· 3 in thread

Interesting but seems less powerful than my current setup:

- I have mitmproxy to capture the traffic / manipulate the traffic

- I have Chrome opened with Selenium/Capybara/chromedriver and using mitmproxy

- I then browse to the target pages, it records the selected requests and the selected responses

- It then replays the requests until they fail (with a delay)

I highly recommend mitmproxy, it's extremely powerful: capture traffic, send responses without hitting the server, block/hang requests, modify responses, modify requests/responses headers.

Then higher level interfaces can be built on top, Selenium allows you to load Chrome extensions and execute Javascript on any page for instance. You can also manage many tabs at the same time.

I could make a blog post/demo if people are interested

kuhhk7y ago

aboutruby7y ago

In my case I can't do the pagination automatically so I have to fetch the pages myself to then have them replayed.

In most cases you would capture the request and change the "page=" parameter (either for an HTML page or an API).

You could also use selenium to click on each "next page". Could be parallelized with multiple tabs / windows.

The only website that blocks me is Bloomberg because they detect mitmproxy (I didn't care enough to make mitmproxy harder to detect).

Another detail is that regular Chrome doesn't let you load insecure certificates while chromedriver allows that.

Anyway, I will write about all that, I already posted some code on my Twitter: https://twitter.com/localhostdotdev (that I will turn into a blog).

marvel_boy7y ago

>I could make a blog post/demo if people are interested

yes, please !

SSchick7y ago· 3 in thread

Are there any other advantages over things like webdriver or puppeteer?

nkozyra7y ago

Not really.

I also have no idea what cheerio brings to the table here.

Seems like a hefty solution to web scraping

sanxiyn7y ago

No, there aren't. Just use WebDriver.

galacticdessert7y ago

One thing about Selenium and Puppeteer is that they trigger captcahs on some websites, making scraping impossible. This would perhaps fix it?

CGamesPlay7y ago· 2 in thread

I'm going to plug my app that does scraping with Electron: https://github.com/CGamesPlay/chronicler

To the commenters who don't understand why this is necessary:

- It reliably loads linked resources in a WYSIWYG fashion, including embedded media and other things that have to be handled in an ad-hoc fashion when using something like BeautifulSoup.

- It handles resources loaded through JavaScript, including HTML5 History API changes.

aloer7y ago

Could you explain what electron offers here over for example a browser plugin? I'm not that familiar with limitations of WebExtension APIs

Might be unlikely that some random script on a random site will target Electron but you never know

CGamesPlay7y ago

j / k navigate · click thread line to collapse