Ask HN: Best way to keep the raw HTML of scraped pages? | Better HN

14 comments

13 comments · 6 top-level

PaulHoule3y ago· 4 in thread

Content addressable storage. Generate names with SHA-3, split off bits of the names into directories like

   name[0:2]/name[0:4]/name[0:6]/name

to keep any of the directories from getting too big (even the filesystem can handle huge directories, various tools you use with it might not) Keep a list of where the files came from and other metadata so you can find things in a database.

marginalia_nu3y ago

You're going to incur insane overhead doing this, since each file's actual size is a multiple of the filesystem's block size, which is similar order of magnitude as a compressed HTML file.

That depends on the file system. It may do block sub allocation (https://en.wikipedia.org/wiki/Block_suballocation)

(Blocks also need not be large, but in this context that’s a theoretical issue)

PaulHoule3y ago

Granted. It's fair to stuff them into SQLite or some other kind of composite file but you trade one problem for another. The same content-addressible indexing still makes sense.

toast03y ago

It would be nice if compressed HTML was on a similar order to filesystem block sizes, but most pages are much larger than 4k bytes even compressed. May depend on the specific site that's being scraped.

compressedgas3y ago· 2 in thread

WARC.

hbcondo7143y ago

The WARC (Web ARChive) format specifies a method for combining multiple digital resources into an aggregate archival file together with related information:

https://www.loc.gov/preservation/digital/formats/fdd/fdd0002...

I thought that mitmproxy did this, but cursory searches didn't show anything; that said, their actual format[1] has even more fidelity (I'd guess it's comparable to wireshark)

One should be aware that WARC is great for preservation, but getting content back out of it would require specialized tooling ala: https://github.com/alard/warc-proxy

1: https://github.com/mitmproxy/mitmproxy/blob/9.0.1/mitmproxy/...

placidpanda3y ago· 1 in thread

When doing this in the past, I settled on an sqlite database with one table that stores the compressed html (gzip or lzma) along with other columns (id/date/url/domain/status/etc.)

Also made it easy to alert on when something broke (query the table for count(*) where status=error) and rerun the parser for failures.

crazygringo3y ago

Yup. A database gives you all the performance AND flexibility you need. MySQL or PostgreSQL will work well too.

Storing pages as files is a no-go because it wastes way too much disk space due to block sizes. While more customized cache tools will never be as flexible or have as much tooling as a widely supported relational database.

For even better compression use a preset dictionary as well tuned to a wide sample of HTML, but it doesn't sound like you need to go that far.

If you weren't already aware, Scrapy has strong support for this via their HTTPCache middleware; you can choose whether to have it actually behave like a cache, choosing to returned already scraped content if matched or merely to act as a pass-through cache: https://docs.scrapy.org/en/2.7/topics/downloader-middleware....

Their OOtB storage does what the sibling comment says about sha1-ing the request and then sharding the output filename by the first two characters: https://github.com/scrapy/scrapy/blob/2.7.1/scrapy/extension...

i'd just apply intelligent file naming strategy, based on timestamps and urls. keep in mind, that a folder should not contain more than 1000 files or other folders, otherwise it's slow to list.

nf-x3y ago

Did you try using some of the cheap cloud storage, like AWS S3?

j / k navigate · click thread line to collapse