Web scraping via JavaScript runtime heap snapshots (opens in new tab)

(adriancooney.ie)

354 pointsadriancooney4y ago65 comments

65 comments

52 comments · 25 top-level

anyfactor4y ago· 4 in thread

Very interesting. Can't wait to give it a shot.

I personally use a combination of xpath, basic math and regex, so this class/id security solution isn't a major deterrent. Couple of times, I did find it to be an hassle to scrape data embedded in iframes, and I can see the heap snapshots treat iframes differently.

Also, if a website takes the extra steps to block web scrapers, identification of elements is never the main problem. It is always IP bans and other security measures.

After all that, I do look forward using something like this and making a switch to nodejs based solution soon. But if you are trying web scraping at scale, reverse engineering should always be your first choice. Not only it enables you a faster solution, it is more ethical (IMO) as you are minimizing your impact to it's resources. Rendering full website resources is always my last choice.

timtom394y ago

> But if you are trying web scraping at scale, reverse engineering should always be your first choice. Not only it enables you a faster solution, it is more ethical (IMO) as you are minimizing your impact to it's resources. Rendering full website resources is always my last choice.

I find my time is by far the most limited resource. I am usually scraping huge corporations at scale and don't care/doubt I will impact their resources. If they would open their APIs I would use those.

That being said, I often end up reverse engineering to preserve my own resources. I can and do run thousands of instances of chrome but it isn't cheap.

Also, related to IPs, carrier grade NAT has been a blessing ;)

anyfactor4y ago

> carrier grade NAT

Are you using something you have built or a service?

woodpanel4y ago

> is more ethical

How do you deal with pages that use JS to load their content (e.g. SERPs) and restrict those endpoints to be called from within that page?

I'm lucky if I can use cheerio to just traverse the DOM on a given page, but increasingly I have to render the page and that "scales" as well, at least in terms of maintainability since I can more or less use the same API to traverse the (then JS-modified) DOM

10000truths4y ago

You can observe the network requests that the JS makes under the Network tab of the Developer Tools console. The restrictions you mention can be bypassed by setting the Origin and Referer HTTP headers to whatever satisfies the server.

BbzzbB4y ago· 4 in thread

This is great, thanks a lot.

It's my understanding that Playwright is the "new Puppeteer" (even with core devs migrating). I presume this sort of technique would be feasible on Playwright too? Do you think there's any advantage or disadvantage of using one over the other for this use case, or it's basically the same (or I'm off base and they're not so interchangeable)?

I'm basing my personal "scraping toolbox" off Scrapy which I think has decent Playwright integration, hence the question if I try to reproduce this strategy in Playwright.

mdaniel4y ago

My understanding of Playwright is that it's trying to be the new Selenium, in that it's a programming language orchestrating the WebDriver protocol

That means that if you are running against Chromium, this will likely work, but unless Firefox has a similar heapdump function, it is unlikely to work[1]. And almost certainly not Safari, based on my experience. All of that is also qualified by whether Playwright exposes that behavior, or otherwise allows one to "get into the weeds" to invoke the function under the hood

1 = as an update, I checked and Firefox does have a memory snapshot feature, but the file it saved is some kind of binary encoded thing without any obvious strings in it

I didn't see any such thing in Safari

asabla4y ago

Well kind of for Firefox, there is this profiling tool which you could use (semi-built in)

https://github.com/firefox-devtools/profiler. Which let you save a report in json.gz format

nmstoker4y ago

I had understood that Playwright actually used the DevTools protocol rather than the WebDriver protocol, as mentioned here:

https://github.com/microsoft/playwright/issues/4862

And there's a bit of detail about how they're different here:

https://stackoverflow.com/q/50939116/142780

However that's more a detail and doesn't really undermine your point about Firefox / Safari being handled differently, it's just that Playwright implemented their own versions of the protocol for those two non-Chromium based browsers

nousermane4y ago

> Firefox does have a memory snapshot feature, but the file it saved is some kind of binary encoded thing without any obvious strings in it

Those .fxsnapshot files are gzipped binary heaps. There is a 3rd-party decoder for it:

https://github.com/jimblandy/fxsnapshot

superasn4y ago· 3 in thread

Awesome, I wonder if there is a possibility to create a chrome extension that works like 'Vue devttools' and show the heap and changes in real-time and maybe allow editing. That would be amazing for learning / debugging.

> We use the --no-headless argument to boot a windowed Chrome instance (i.e. not headless) because Google can detect and thwart headless Chrome - but that's a story for another time.

Use `puppeteer-extra-plugin-stealth`(1) for such sites. It defeats a lot of bot identification including recaptcha v3.

(1) https://www.npmjs.com/package/puppeteer-extra-plugin-stealth

acemarke4y ago

Not _quite_ what you're describing, but Replay [0], the company I work for, _is_ building a true "time-traveling debugger" for JS. It works by recording the OS-level interactions with the browser process, then re-running those in the cloud. From the user's perspective in our debugging client UI, they can jump to any point in a timeline and do typical step debugging. However, you can also see how many times any line of code ran, and also add print statements to any line that will print out the results from _every time that line got executed_.

So, no heap analysis per se, but you can definitely inspect the variables and stack from anywhere in the recording.

Right now our debugging client is just scratching the surface of the info we have available from our backend. We recently put together a couple small examples that use the Replay backend API to extract data from recordings and do other analysis, like generating code coverage reports and introspecting React's internals to determine whether a given component was mounting or re-rendering.

Given that capability, we hope to add the ability to do "React component stack" debugging in the not-too-distant future, such as a button that would let you "Step Back to Parent Component". We're also working on adding Redux DevTools integration now (like, I filed an initial PR for this today! [2]), and hope to add integration with other frameworks down the road.

[0] https://replay.io

[1] https://github.com/RecordReplay/replay-protocol-examples

[2] https://github.com/RecordReplay/devtools/pull/6601

mdaniel4y ago

Wowzers, that must make you an impressive attack target for all the session data that gets uploaded to your site. How do you deal with user consent in those cases?

I was curious to see what that experience was like from a client side, but it seems https://newaer.com/ is bombing the .min.js include, which of course doesn't turn on said session capture

BuiltWith alleges you use replay on replay.io but I didn't see any references to it on the main page, and app.replay.io is a white screen due to getInitialTabsState blowing up in src/ui/setup/index.ts

1 more reply

oldmanhorton4y ago

Before the whole project was killed, Node-ChakraCore had a time travel debugger that worked pretty well. I don't know how easy it would be to port the methods it used to a chrome extension (my guess is somewhere between difficult and impossible), but browser vendors could implement this natively.

chrismeller4y ago· 3 in thread

A neat idea for sure, I just wanted to point out that this is why I prefer XPath over CSS selectors.

We all know the display of the page and the structure of the page should be mutually exclusive, so why would you base your selectors on display? Particularly if you’re looking for something on a semantically designed page, why would I look for an .article, a class that may disappear with the next redesign, when they’re unlikely to stop using the article HTML tag?

goldenkey4y ago

CSS selectors don't have to select purely by classes. They can be something like:

div > div > * > *:nth-child(7)

XPath doesn't have any additional abilities, it's just verbose and difficult to write. It's a lemon.

tommica4y ago

I might be wrong, but xpath has contains, where you can look for a text content inside an element, which I don't think CSS can do

1 more reply

chrismeller4y ago

Well that is 100% originally an XPath selector (:nth-child), so kudos if CSS selectors support it now.

Still, using // instead of multiple *’s (and the two divs) still seems better for longer-term scraping.

scriptsmith4y ago· 2 in thread

In a similar vein, I have found success using request interception [1] for some websites where the HTML and API authentication scheme is unstable, but the API responses themselves are stable.

If you can drive the browser using simple operations like keyboard commands, you can get the underlying data reliably by listening for matching 'response' events and handling the data as it comes in.

[1] https://github.com/puppeteer/puppeteer/blob/main/docs/api.md...

RockRobotRock4y ago

For this use-case, selenium-wire for Python could be really useful.

tylergetsay4y ago

You can also inspect the application storage, monitor for cookie changes, etc using the dev tools protocol

mwcampbell4y ago· 2 in thread

> Developers no longer need to label their data with class-names or ids - it's only a courtesy to screen readers now.

In general, screen readers don't use class names or IDs. In principle they can, to enable site-specific workarounds for accessibility problems. But of course, that's as fragile as scraping. Perhaps you were thinking of semantic HTML tag names and ARIA roles.

ComputerGuru4y ago

Anything relying on id/class names has been broken since the advent of machine-generated names that come part and parcel with the most popular SPA frameworks. They’re all gobbly-dook now, which makes writing custom ad block cosmetic filters a real PITA.

jchw4y ago

React doesn’t do that. You may still find gibberish on hostile sites like Twitter which intentionally obfuscate class names, using something like React Armor.

1 more reply

rvnx4y ago· 2 in thread

Nice this won't work anymore then

benbristow4y ago

Exactly my thoughts - the author is using it 'in production' - speaking out loud to a forum where Facebook/Meta employees (and other Silicon Valley folk) are definitely observing is a rookie mistake

pabs34y ago

How would you prevent it from being possible?

marwis4y ago· 2 in thread

This sadly does not help if js code is minified/obfuscated and data is exchanged using some binary/binary-like protocol like grpc. Unfortunately this is increasingly common.

The only long term way is to parse visible text.

mdaniel4y ago

I've never seen grpc from a browser on a consumer-facing site; do you have an example I could see?

That said, for this approach something like grpc would be a benefit since AIUI grpc is designed to be versioned so one could identify structural changes in the payload fairly quickly versus the json-y way of "I dunno, are there suddenly new fields?"

marwis4y ago

Not aware of any actual grpc websites but given grpc-web has 6.5k stars on github something must be out there.

Google's websites frequently use binary-like formats where json is just an array of values with no properties, and most of these values are numbers. See for example Gmail.

1 more reply

1vuio0pswjnm74y ago· 2 in thread

Why doesn't the example chosen, YouTube, use something like Cloudflare "anti-bot" protection or Google reCAPTCHA.

When I request a video page, I can see the JSON in the page, without the need for examining a heap snapshot.

quickthrower24y ago

Because cloudflare, recaptcha etc. mean this is not general in possible. You need to quack like a normal user for it to work. If a site is really against scraping they could probably completely make it uneconomical by tracking user footprints and detect unexpected patterns of usage.

datalopers4y ago

They detect and block headless browsers just as easy.

mdaniel4y ago· 1 in thread

That's an exceedingly clever idea, thanks for sharing it!

Please consider adding an actual license text file to your repo, since (a) I don't think GitHub's licensee looks inside package.json (b) I bet most of the "license" properties of package.json files are "yeah, yeah, whatever" versus an intentional choice: https://github.com/adriancooney/puppeteer-heap-snapshot/blob... I'm not saying that applies to you, but an explicit license file in the repo would make your wishes clearer

adriancooneyOP4y ago

Ah thank you for the reminder. Added it now!

dymk4y ago· 1 in thread

Would this method work if the website obfuscated its HTML as per the usual techniques, but also rendered everything server side?

adriancooneyOP4y ago

If it’s rendered server-side - no. The data likely won’t be loaded into the JS heap (the DOM isn’t included in the heap snapshots) when you visit the page. You might be in luck if the website executes JavaScript to augment the server-side rendered page however. If it does, your data may be loaded into memory in a way you can extract it.

EastSmith4y ago· 1 in thread

Someone knows if a Chrome browser extension has access to heap snapshots?

kevingadd4y ago

You'd want to use the debugging API, which is available either via websockets or via the chrome.debugger extension API. The latter will require a specific permission though, I think.

trinovantes4y ago

If this catches on, web developers may start employing memory obscurification techniques like game developers

https://technology.riotgames.com/news/riots-approach-anti-ch...

elbajo4y ago

Love this approach, thanks for sharing!

I am trying this on a website for which Puppeteer has trouble loading so I got a heap snapshot directly in Chrome. I was trying to search for relevant objects directly in the Chrome heap viewer but I don't think the search looks inside objects.

I think your tool would work: "puppeteer-heap-snapshot query -f /tmp/file.heapsnapshot -p property1" or really any JSON parser but it requires extra steps. Would you say this is the easiest way to view/debug a heap snapshot?

marmada4y ago

Wow this is brilliant. I've sometimes tried to reverse engineer APIs in the past, but this is definitely the next level.

I used to think ML models could be good for scraping too, but this seems better.

I think this + a network request interception tool (to get data that is embedded into HTML) could be the future.

kvathupo4y ago

The article brings up two interesting points for web preservation:

1. The reliance on externally hosted APIs

2. Source code obfuscation

For 1, in order to fully preserve a webpage, you'd have to go down the rabbit hole of externally hosted APIs, and preserve those as well. For example, sometimes a webpage won't render latex notation since a MathJax endpoint can't be connected to. Were we to save this webpage, we would need a copy of MathJax JS too.

For 2, I think WASM makes things more interesting. With Web Assembly, I'd imagine it's much easier to obfuscate source code: a preservationist would need a WASM decompiler for whatever source language was used.

invalidname4y ago

Scraping is inherently fragile due to all the small changes that can happen to the data model as a website evolves. The important thing is to fix these things quickly. This article discusses a related approach of debugging such failures directly on the server: https://talktotheduck.dev/debugging-jsoup-java-code-in-produ...

It's in Java (using JSoup) but the approach will work for Node, Python, Kotlin etc. The core concept is to discover the cause of the regression instantly on the server and deploy a fix fast. There are also user specific regressions in scraping that are again very hard to debug.

leloctai4y ago

This isn't future proof at all. Game dev had been using automatic memory obfuscation since forever. If this become popular, it will take no more than adding a webpack plugin to defeat, no data structure changes required.

kccqzy4y ago

Very interesting! I have a feeling that this will break if people use the advanced mode of the Closure compiler. It's able to optimize away object attribute names. Is this not something commonly done anymore?

flockonus4y ago

Awesome experimentation! I'd be curious to how you navigate the heap dump in some real website examples.

lemax4y ago

I've used a similar technique on some web pages that get returned from the server with an in-tact redux state object just sitting in a <script> tag. Instead of parsing the HTML, I just pull out the state object. Super

BenGosub4y ago

Is he scraping the heap because the data wasn't present in the HTML, or is he doing it because the API response, present in the heap changes less often than the HTML?

pabs34y ago

Seems easy to defeat by deleting objects after generating the HTML or DOM nodes? Although I suppose taking heap snapshots before the deletions would avoid that.

1 more reply

radicality4y ago

Depending on how exactly the page is loading data, it might be easier to use something like mitmproxy and observe the data flow and intercept there.

Jiger1044y ago

Really cool approach, great work

j / k navigate · click thread line to collapse

65 comments

52 comments · 25 top-level

anyfactor4y ago· 4 in thread

Very interesting. Can't wait to give it a shot.

Also, if a website takes the extra steps to block web scrapers, identification of elements is never the main problem. It is always IP bans and other security measures.

timtom394y ago

That being said, I often end up reverse engineering to preserve my own resources. I can and do run thousands of instances of chrome but it isn't cheap.

Also, related to IPs, carrier grade NAT has been a blessing ;)

anyfactor4y ago

> carrier grade NAT

Are you using something you have built or a service?

woodpanel4y ago

> is more ethical

How do you deal with pages that use JS to load their content (e.g. SERPs) and restrict those endpoints to be called from within that page?

10000truths4y ago

BbzzbB4y ago· 4 in thread

This is great, thanks a lot.

I'm basing my personal "scraping toolbox" off Scrapy which I think has decent Playwright integration, hence the question if I try to reproduce this strategy in Playwright.

mdaniel4y ago

My understanding of Playwright is that it's trying to be the new Selenium, in that it's a programming language orchestrating the WebDriver protocol

1 = as an update, I checked and Firefox does have a memory snapshot feature, but the file it saved is some kind of binary encoded thing without any obvious strings in it

I didn't see any such thing in Safari

asabla4y ago

Well kind of for Firefox, there is this profiling tool which you could use (semi-built in)

https://github.com/firefox-devtools/profiler. Which let you save a report in json.gz format

nmstoker4y ago

I had understood that Playwright actually used the DevTools protocol rather than the WebDriver protocol, as mentioned here:

https://github.com/microsoft/playwright/issues/4862

And there's a bit of detail about how they're different here:

https://stackoverflow.com/q/50939116/142780

nousermane4y ago

> Firefox does have a memory snapshot feature, but the file it saved is some kind of binary encoded thing without any obvious strings in it

Those .fxsnapshot files are gzipped binary heaps. There is a 3rd-party decoder for it:

https://github.com/jimblandy/fxsnapshot

superasn4y ago· 3 in thread

> We use the --no-headless argument to boot a windowed Chrome instance (i.e. not headless) because Google can detect and thwart headless Chrome - but that's a story for another time.

Use `puppeteer-extra-plugin-stealth`(1) for such sites. It defeats a lot of bot identification including recaptcha v3.

(1) https://www.npmjs.com/package/puppeteer-extra-plugin-stealth

acemarke4y ago

So, no heap analysis per se, but you can definitely inspect the variables and stack from anywhere in the recording.

[0] https://replay.io

[1] https://github.com/RecordReplay/replay-protocol-examples

[2] https://github.com/RecordReplay/devtools/pull/6601

mdaniel4y ago

Wowzers, that must make you an impressive attack target for all the session data that gets uploaded to your site. How do you deal with user consent in those cases?

I was curious to see what that experience was like from a client side, but it seems https://newaer.com/ is bombing the .min.js include, which of course doesn't turn on said session capture

1 more reply

oldmanhorton4y ago

chrismeller4y ago· 3 in thread

A neat idea for sure, I just wanted to point out that this is why I prefer XPath over CSS selectors.

goldenkey4y ago

CSS selectors don't have to select purely by classes. They can be something like:

div > div > * > *:nth-child(7)

XPath doesn't have any additional abilities, it's just verbose and difficult to write. It's a lemon.

tommica4y ago

I might be wrong, but xpath has contains, where you can look for a text content inside an element, which I don't think CSS can do

1 more reply

chrismeller4y ago

Well that is 100% originally an XPath selector (:nth-child), so kudos if CSS selectors support it now.

Still, using // instead of multiple *’s (and the two divs) still seems better for longer-term scraping.

scriptsmith4y ago· 2 in thread

In a similar vein, I have found success using request interception [1] for some websites where the HTML and API authentication scheme is unstable, but the API responses themselves are stable.

If you can drive the browser using simple operations like keyboard commands, you can get the underlying data reliably by listening for matching 'response' events and handling the data as it comes in.

[1] https://github.com/puppeteer/puppeteer/blob/main/docs/api.md...

RockRobotRock4y ago

For this use-case, selenium-wire for Python could be really useful.

tylergetsay4y ago

You can also inspect the application storage, monitor for cookie changes, etc using the dev tools protocol

mwcampbell4y ago· 2 in thread

> Developers no longer need to label their data with class-names or ids - it's only a courtesy to screen readers now.

ComputerGuru4y ago

jchw4y ago

React doesn’t do that. You may still find gibberish on hostile sites like Twitter which intentionally obfuscate class names, using something like React Armor.

1 more reply

rvnx4y ago· 2 in thread

Nice this won't work anymore then

benbristow4y ago

Exactly my thoughts - the author is using it 'in production' - speaking out loud to a forum where Facebook/Meta employees (and other Silicon Valley folk) are definitely observing is a rookie mistake

pabs34y ago

How would you prevent it from being possible?

marwis4y ago· 2 in thread

This sadly does not help if js code is minified/obfuscated and data is exchanged using some binary/binary-like protocol like grpc. Unfortunately this is increasingly common.

The only long term way is to parse visible text.

mdaniel4y ago

I've never seen grpc from a browser on a consumer-facing site; do you have an example I could see?

marwis4y ago

Not aware of any actual grpc websites but given grpc-web has 6.5k stars on github something must be out there.

Google's websites frequently use binary-like formats where json is just an array of values with no properties, and most of these values are numbers. See for example Gmail.

1 more reply

1vuio0pswjnm74y ago· 2 in thread

Why doesn't the example chosen, YouTube, use something like Cloudflare "anti-bot" protection or Google reCAPTCHA.

When I request a video page, I can see the JSON in the page, without the need for examining a heap snapshot.

quickthrower24y ago

datalopers4y ago

They detect and block headless browsers just as easy.

mdaniel4y ago· 1 in thread

That's an exceedingly clever idea, thanks for sharing it!

adriancooneyOP4y ago

Ah thank you for the reminder. Added it now!

dymk4y ago· 1 in thread

Would this method work if the website obfuscated its HTML as per the usual techniques, but also rendered everything server side?

adriancooneyOP4y ago

EastSmith4y ago· 1 in thread

Someone knows if a Chrome browser extension has access to heap snapshots?

kevingadd4y ago

You'd want to use the debugging API, which is available either via websockets or via the chrome.debugger extension API. The latter will require a specific permission though, I think.

trinovantes4y ago

If this catches on, web developers may start employing memory obscurification techniques like game developers

https://technology.riotgames.com/news/riots-approach-anti-ch...

elbajo4y ago

Love this approach, thanks for sharing!

marmada4y ago

Wow this is brilliant. I've sometimes tried to reverse engineer APIs in the past, but this is definitely the next level.

I used to think ML models could be good for scraping too, but this seems better.

I think this + a network request interception tool (to get data that is embedded into HTML) could be the future.

kvathupo4y ago

The article brings up two interesting points for web preservation:

1. The reliance on externally hosted APIs

2. Source code obfuscation

invalidname4y ago

leloctai4y ago

kccqzy4y ago

flockonus4y ago

Awesome experimentation! I'd be curious to how you navigate the heap dump in some real website examples.

lemax4y ago

BenGosub4y ago

Is he scraping the heap because the data wasn't present in the HTML, or is he doing it because the API response, present in the heap changes less often than the HTML?

pabs34y ago

Seems easy to defeat by deleting objects after generating the HTML or DOM nodes? Although I suppose taking heap snapshots before the deletions would avoid that.

1 more reply

radicality4y ago

Depending on how exactly the page is loading data, it might be easier to use something like mitmproxy and observe the data flow and intercept there.

Jiger1044y ago

Really cool approach, great work

j / k navigate · click thread line to collapse