Yes, of course! To get the source code of a web site you don't need a browser and all its complexity. It makes me so sad how far we have come in terms of unnecessary complexity for simple tasks.
If you want to extract data from web pages without requiring hundreds of megabytes for something like Electron, there are lots of scraping libraries out there. There are for example at least two good Python implementations: Scrapy[1] and BeautifulSoup[2].
[1]: https://scrapy.org/
I do a bunch of web-scraping for hobby shit, and I've love to be able to not have to shell out to chromium for some sites, but unfortunately the modern web basically means you're stuck with it.
You’re right in the overhead though; I’d stay miles away from Electron for scraping but you’ll need more than a CURL wrapper to properly fetch data in all shapes and sizes :) Headless Chromium does do the trick in that regard.
So i will edit this question to this : Is there a better way to code a portable application with a graphical user interface to scrape a given site ?
Thanks for your comment.
Some example would be Scrapinghub's Portia system and the Kantu startup. There are also established players like UIPath and Visualwebripper.
What would you want to scrape there which is not against their ToS and a violation of user privacy in general?
That would be all I need to satisfy all my browser automation tasks.
Without installing and learning any frameworks.
Say there was a linux command 'SendToChromium' that would do that for Chromium. Then to navigate to some page one could simply do this:
SendToChromium location.href="/somepage.html"
SendToChromium should return the output of the command. So to get the html of the current page, one would simply do:
SendToChromium document.body.innerHTML > theBody.html
Ideally the browser would listen for this type of command on a local port. So instead of needing a binary 'SendToChromium' one could simply start Chromium in listening mode:
chromium --listen 12345
And then talk to it via http:
curl 127.0.0.1:12345/execute?command=location.href="/somepage.html"
I think you've given my weekend some purpose. Lemme see what I can pull together...
I just got it working like this:
apt install chromium-chromedriver
chromedriver
This seems to create a service that listens on port 9515 for standardized commands to remote control a chromium instance. The commands seem to be specified by the W3C:https://www.w3.org/TR/webdriver/
I got it to open a browser with this curl command:
curl -d '{ "desiredCapabilities": { "caps": { "nativeEvents": false, "browserName": "chrome", "version": "", "platform": "ANY" } } }' http://localhost:9515/session
I have not yet figured out how to send javascript commands though. apt install chromedriver
gives me: Package 'chromedriver' has no installation candidate
There is something called "chromium-chromedriver".Let me try that ... one moment ...
Ok. So I start it via:
apt install chromium-chromedriver
Now according to the docs, this should create a browser: curl -d '{ "desiredCapabilities": { "caps": { "nativeEvents": false, "browserName": "chrome", "version": "", "platform": "ANY" } } }' http://localhost:9515/session
Ha! It works!So chromedriver might be a solution!
The CLI interface is somewhat incomplete as-of-now, but it'd be fairly easy to add more comprehensive tools.
It’s very much like that already: write script, send to browser, let it do it’s thing, run javascript code if you want and get the final renderend HTML and console output.
What would you do with this solution
Automate the browser that can’t already be done
I did not say it can't be done now. But I don't want the overhead of pupeteer, selenium plus the client libraries to control them. These things are fricking monsters. And they will go out of fashion at some point. Simple javascript commands will not. with headless browsers
I don't want headless. I just want to automate.Won't be surprised if an extension like this already exists.
- I have mitmproxy to capture the traffic / manipulate the traffic
- I have Chrome opened with Selenium/Capybara/chromedriver and using mitmproxy
- I then browse to the target pages, it records the selected requests and the selected responses
- It then replays the requests until they fail (with a delay)
I highly recommend mitmproxy, it's extremely powerful: capture traffic, send responses without hitting the server, block/hang requests, modify responses, modify requests/responses headers.
Then higher level interfaces can be built on top, Selenium allows you to load Chrome extensions and execute Javascript on any page for instance. You can also manage many tabs at the same time.
I could make a blog post/demo if people are interested
In most cases you would capture the request and change the "page=" parameter (either for an HTML page or an API).
You could also use selenium to click on each "next page". Could be parallelized with multiple tabs / windows.
The only website that blocks me is Bloomberg because they detect mitmproxy (I didn't care enough to make mitmproxy harder to detect).
Another detail is that regular Chrome doesn't let you load insecure certificates while chromedriver allows that.
Anyway, I will write about all that, I already posted some code on my Twitter: https://twitter.com/localhostdotdev (that I will turn into a blog).
yes, please !
I also have no idea what cheerio brings to the table here.
Seems like a hefty solution to web scraping
To the commenters who don't understand why this is necessary:
- It reliably loads linked resources in a WYSIWYG fashion, including embedded media and other things that have to be handled in an ad-hoc fashion when using something like BeautifulSoup.
- It handles resources loaded through JavaScript, including HTML5 History API changes.
It looks like an interesting project but only for a few selected sites. For more random browsing I believe I would be too security conscious (https://electronjs.org/docs/tutorial/security) to allow it
Might be unlikely that some random script on a random site will target Electron but you never know
Locking down Electron apps to be safe on the larger web is certainly one area where I think electron could do a lot better. I think my project has followed all of the recommendations and should be safe, but I agree with you that it feels like a bigger attack surface. I personally wanted it to support offline browsing of documentation sites, which are generally pretty "safe" from that perspective.