undefined | Better HN

0 pointsalpaca1281y ago0 comments

No need for a data dump, just list all URLs or whatever else of their training data sources. Afaik that's how the LAION training dataset was published.

0 comments

3 comments · 1 top-level

anonymoushn1y ago· 2 in thread

providing a large list of bitrotted URLs and titles of books which the user should OCR themselves before attempting to reproduce the model doesn't seem very useful.

echoangle1y ago

Aren't the datasets mostly shared in torrents? They probably won't bitrot for some time.

Wowfunhappy1y ago

...no? They also use web crawlers.

2 more replies

j / k navigate · click thread line to collapse