Perhaps two sets, one that's just a few hundred kilobytes that contains a few sample .arc files to test against the format, and then one larger 'training' set that's small enough to test against offline (maybe like 100MB?) but large enough to contain a good sample of the possible content.
Concur with this comment -- it might also help the community provide feedback on structure and ways to segment that data so that there are more directed efforts to consume small parts of the crawl for processing