Skip to content
Better HN
Top
New
Best
Ask
Show
Jobs
Search
⌘K
Extracting Subset of Common Crawl Data on Laptop | Better HN
Extracting Subset of Common Crawl Data on Laptop
(opens in new tab)
(avilpage.com)
1 points
chillaranand
3y ago
1 comments
Share
1 comments
default
newest
oldest
chillaranand
OP
3y ago
Each Common crawl monthly data consists of ~100 TB. For some use cases, we don't need entire data set. We just need a subset of the data.
In this post, lets see how we can extract sub set of the data from our laptop itself.
j
/
k
navigate · click thread line to collapse