undefined | Better HN

0 pointssamtc8y ago0 comments

We monitor exceptions with Sentry. We store raw data so we don't have to hurry to fix the ETL, we only have to fix navigation logic and we keep crawling.

0 comments

2 comments · 1 top-level

Launchr8y ago· 1 in thread

Sorry if it's a stupid question/example/comparison, just trying to understand better: You're storing the full html data instead of reaching into the specific div's for the data you might need? This way, separating the fetching from the parsing?

I'm a scraping rookie, and I usually fetch + parse in the same call, this might resolve some issues for me :) thanks!

jimsmart8y ago

When I've done scraping, I've always taken this approach also: I decouple my process into paired fetch-to-local-cache-folder and process-cached-files stages.

I find this useful for several reasons, but particularly if you want to recrawl the same site for new/updated content, or if you decide to grab extra data from the pages (or, indeed, if your original parsing goes wrong or meets pages it wasn't designed for).

Related: As well as any pages I cache, I generally also have each stage output a CSV (requested url, local file name, status, any other relevant data or metadata), which can be used to drive later stages, or may contain the final output data.

Requesting all of the pages is the biggest time sink when scraping — it's good to avoid having to do any portion of that again, if possible.

j / k navigate · click thread line to collapse