undefined | Better HN

0 pointsshowerst14y ago0 comments

Perhaps two sets, one that's just a few hundred kilobytes that contains a few sample .arc files to test against the format, and then one larger 'training' set that's small enough to test against offline (maybe like 100MB?) but large enough to contain a good sample of the possible content.

0 comments

1 comments · 1 top-level

dcnstrct14y ago

Concur with this comment -- it might also help the community provide feedback on structure and ways to segment that data so that there are more directed efforts to consume small parts of the crawl for processing

j / k navigate · click thread line to collapse