I personally use a combination of xpath, basic math and regex, so this class/id security solution isn't a major deterrent. Couple of times, I did find it to be an hassle to scrape data embedded in iframes, and I can see the heap snapshots treat iframes differently.
Also, if a website takes the extra steps to block web scrapers, identification of elements is never the main problem. It is always IP bans and other security measures.
After all that, I do look forward using something like this and making a switch to nodejs based solution soon. But if you are trying web scraping at scale, reverse engineering should always be your first choice. Not only it enables you a faster solution, it is more ethical (IMO) as you are minimizing your impact to it's resources. Rendering full website resources is always my last choice.
I find my time is by far the most limited resource. I am usually scraping huge corporations at scale and don't care/doubt I will impact their resources. If they would open their APIs I would use those.
That being said, I often end up reverse engineering to preserve my own resources. I can and do run thousands of instances of chrome but it isn't cheap.
Also, related to IPs, carrier grade NAT has been a blessing ;)
Are you using something you have built or a service?
How do you deal with pages that use JS to load their content (e.g. SERPs) and restrict those endpoints to be called from within that page?
I'm lucky if I can use cheerio to just traverse the DOM on a given page, but increasingly I have to render the page and that "scales" as well, at least in terms of maintainability since I can more or less use the same API to traverse the (then JS-modified) DOM
It's my understanding that Playwright is the "new Puppeteer" (even with core devs migrating). I presume this sort of technique would be feasible on Playwright too? Do you think there's any advantage or disadvantage of using one over the other for this use case, or it's basically the same (or I'm off base and they're not so interchangeable)?
I'm basing my personal "scraping toolbox" off Scrapy which I think has decent Playwright integration, hence the question if I try to reproduce this strategy in Playwright.
That means that if you are running against Chromium, this will likely work, but unless Firefox has a similar heapdump function, it is unlikely to work[1]. And almost certainly not Safari, based on my experience. All of that is also qualified by whether Playwright exposes that behavior, or otherwise allows one to "get into the weeds" to invoke the function under the hood
1 = as an update, I checked and Firefox does have a memory snapshot feature, but the file it saved is some kind of binary encoded thing without any obvious strings in it
I didn't see any such thing in Safari
https://github.com/firefox-devtools/profiler. Which let you save a report in json.gz format
https://github.com/microsoft/playwright/issues/4862
And there's a bit of detail about how they're different here:
https://stackoverflow.com/q/50939116/142780
However that's more a detail and doesn't really undermine your point about Firefox / Safari being handled differently, it's just that Playwright implemented their own versions of the protocol for those two non-Chromium based browsers
Those .fxsnapshot files are gzipped binary heaps. There is a 3rd-party decoder for it:
> We use the --no-headless argument to boot a windowed Chrome instance (i.e. not headless) because Google can detect and thwart headless Chrome - but that's a story for another time.
Use `puppeteer-extra-plugin-stealth`(1) for such sites. It defeats a lot of bot identification including recaptcha v3.
(1) https://www.npmjs.com/package/puppeteer-extra-plugin-stealth
So, no heap analysis per se, but you can definitely inspect the variables and stack from anywhere in the recording.
Right now our debugging client is just scratching the surface of the info we have available from our backend. We recently put together a couple small examples that use the Replay backend API to extract data from recordings and do other analysis, like generating code coverage reports and introspecting React's internals to determine whether a given component was mounting or re-rendering.
Given that capability, we hope to add the ability to do "React component stack" debugging in the not-too-distant future, such as a button that would let you "Step Back to Parent Component". We're also working on adding Redux DevTools integration now (like, I filed an initial PR for this today! [2]), and hope to add integration with other frameworks down the road.
[1] https://github.com/RecordReplay/replay-protocol-examples
I was curious to see what that experience was like from a client side, but it seems https://newaer.com/ is bombing the .min.js include, which of course doesn't turn on said session capture
BuiltWith alleges you use replay on replay.io but I didn't see any references to it on the main page, and app.replay.io is a white screen due to getInitialTabsState blowing up in src/ui/setup/index.ts
We all know the display of the page and the structure of the page should be mutually exclusive, so why would you base your selectors on display? Particularly if you’re looking for something on a semantically designed page, why would I look for an .article, a class that may disappear with the next redesign, when they’re unlikely to stop using the article HTML tag?
div > div > * > *:nth-child(7)
XPath doesn't have any additional abilities, it's just verbose and difficult to write. It's a lemon.
Still, using // instead of multiple *’s (and the two divs) still seems better for longer-term scraping.
If you can drive the browser using simple operations like keyboard commands, you can get the underlying data reliably by listening for matching 'response' events and handling the data as it comes in.
[1] https://github.com/puppeteer/puppeteer/blob/main/docs/api.md...
In general, screen readers don't use class names or IDs. In principle they can, to enable site-specific workarounds for accessibility problems. But of course, that's as fragile as scraping. Perhaps you were thinking of semantic HTML tag names and ARIA roles.
The only long term way is to parse visible text.
That said, for this approach something like grpc would be a benefit since AIUI grpc is designed to be versioned so one could identify structural changes in the payload fairly quickly versus the json-y way of "I dunno, are there suddenly new fields?"
Google's websites frequently use binary-like formats where json is just an array of values with no properties, and most of these values are numbers. See for example Gmail.
When I request a video page, I can see the JSON in the page, without the need for examining a heap snapshot.
Please consider adding an actual license text file to your repo, since (a) I don't think GitHub's licensee looks inside package.json (b) I bet most of the "license" properties of package.json files are "yeah, yeah, whatever" versus an intentional choice: https://github.com/adriancooney/puppeteer-heap-snapshot/blob... I'm not saying that applies to you, but an explicit license file in the repo would make your wishes clearer
https://technology.riotgames.com/news/riots-approach-anti-ch...
I am trying this on a website for which Puppeteer has trouble loading so I got a heap snapshot directly in Chrome. I was trying to search for relevant objects directly in the Chrome heap viewer but I don't think the search looks inside objects.
I think your tool would work: "puppeteer-heap-snapshot query -f /tmp/file.heapsnapshot -p property1" or really any JSON parser but it requires extra steps. Would you say this is the easiest way to view/debug a heap snapshot?
I used to think ML models could be good for scraping too, but this seems better.
I think this + a network request interception tool (to get data that is embedded into HTML) could be the future.
1. The reliance on externally hosted APIs
2. Source code obfuscation
For 1, in order to fully preserve a webpage, you'd have to go down the rabbit hole of externally hosted APIs, and preserve those as well. For example, sometimes a webpage won't render latex notation since a MathJax endpoint can't be connected to. Were we to save this webpage, we would need a copy of MathJax JS too.
For 2, I think WASM makes things more interesting. With Web Assembly, I'd imagine it's much easier to obfuscate source code: a preservationist would need a WASM decompiler for whatever source language was used.
It's in Java (using JSoup) but the approach will work for Node, Python, Kotlin etc. The core concept is to discover the cause of the regression instantly on the server and deploy a fix fast. There are also user specific regressions in scraping that are again very hard to debug.