undefined | Better HN

0 pointsphainopepla211d ago0 comments

Have any major open weight models been "open data"? Wouldn't that entail distributing vast amounts of copyrighted data?

0 comments

10 comments · 3 top-level

jubilanti11d ago· 6 in thread

Olmo from AllenAI has been releasing their full pipelines including data [1]. A lot of it is just repackaged and resampled dumps from copyrighted data that has long been publicly available as dumps: Common Crawl, arxiv, Wikipedia, StackExchange, reddit --- all of which are presumably copyrighted with different licenses. Go in Huggingface and you can find massive multi TB data dumps used for pre training.

It is just as legal as when Uber and AirBNB were running illegal taxis and hotels during their growth phase. I'm just waiting for some corporate IP law firm to learn about Huggingface.

[1] https://huggingface.co/datasets/allenai/dolma3_pool

__float11d ago

It's rather off-topic at this point, but I've never understood how HF can afford to be a CDN for such huge files. It seems like enterprise customers must be subsidizing a lot, but...at that point, is there not a cheaper alternative that doesn't subsidize every hobbyist and startup around?

tw198411d ago

> how HF can afford to be a CDN for such huge files

bandwidth and storage are literally free when compared to the cost of GPU clusters. HF gets rewarded heavily on capital market for being in AI without actually doing much AI stuff, that is a huge win when compared to costs they are paying for bandwidth and storage.

re-thc11d ago

> how HF can afford to be a CDN for such huge files

To be precise, Amazon Cloudfront is the CDN. Maybe they got some startup deal?

Amazon does now also have flat rate plans that are a lot cheaper.

hnfong11d ago

> I'm just waiting for some corporate IP law firm to learn about Huggingface.

Presumably they already know. The issue is that IP law firms are tiny compared to the trillions of capital pouring into "AI". And if you believe the USA is a capitalist country where the side with deeper pockets win, you know you're not going to win against the trillionaires.

alchemist1e911d ago

Why is the text field in dataset preview table populated with pornographic labels?

yencabulator8d ago

Because it's a random sample of the Internet?

1 more reply

my12311d ago· 1 in thread

NVIDIA's recent Nemotrons tend to be open training data and code.

Probably as a base to use by people buying NVIDIA hardware to train their own.

lambda11d ago

Nemotron is mostly open data. They only release a portions of their pre-training data. From https://docs.nvidia.com/nemotron/latest/nemotron/super3/pret...

  Open-source data coverage: The released datasets cover an estimated 8–10T tokens 
  (~40–50% of the internal 25T blend). Missing categories include code (~14% of blend),
  nemotron-cc-code (~2%), crawl++ (~2%), and academic text (~2%). Users should 
  supplement with their own data for these categories and adjust train_iters 
  accordingly.

Nemotron is the strongest model (on most benchmarks) that has its full training pipeline and most of the data open. Olmo 3 from AllenAI, and K2 Think V2 from Mohamed bin Zayed University of Artificial Intelligence are both fully open, but not as capable as the Nemotron family. Granite has much of the training pipeline and data open, but is missing some of each.

tuananh11d ago

ibm granite has been open data from the beginning iirc

j / k navigate · click thread line to collapse

0 comments

10 comments · 3 top-level

jubilanti11d ago· 6 in thread

It is just as legal as when Uber and AirBNB were running illegal taxis and hotels during their growth phase. I'm just waiting for some corporate IP law firm to learn about Huggingface.

[1] https://huggingface.co/datasets/allenai/dolma3_pool

__float11d ago

tw198411d ago

> how HF can afford to be a CDN for such huge files

re-thc11d ago

> how HF can afford to be a CDN for such huge files

To be precise, Amazon Cloudfront is the CDN. Maybe they got some startup deal?

Amazon does now also have flat rate plans that are a lot cheaper.

hnfong11d ago

> I'm just waiting for some corporate IP law firm to learn about Huggingface.

alchemist1e911d ago

Why is the text field in dataset preview table populated with pornographic labels?

yencabulator8d ago

Because it's a random sample of the Internet?

1 more reply

my12311d ago· 1 in thread

NVIDIA's recent Nemotrons tend to be open training data and code.

Probably as a base to use by people buying NVIDIA hardware to train their own.

lambda11d ago

Nemotron is mostly open data. They only release a portions of their pre-training data. From https://docs.nvidia.com/nemotron/latest/nemotron/super3/pret...

  Open-source data coverage: The released datasets cover an estimated 8–10T tokens 
  (~40–50% of the internal 25T blend). Missing categories include code (~14% of blend),
  nemotron-cc-code (~2%), crawl++ (~2%), and academic text (~2%). Users should 
  supplement with their own data for these categories and adjust train_iters 
  accordingly.

tuananh11d ago

ibm granite has been open data from the beginning iirc

j / k navigate · click thread line to collapse