Cerebras-GPT: A Family of Open, Compute-Efficient, Large Language Models (opens in new tab)

(cerebras.net)

567 pointsasb3y ago232 comments

232 comments

133 comments · 29 top-level

2bitencryption3y ago· 37 in thread

This type of article (or press release, or whatever you want to call it) is exactly what makes the future so interesting.

The cat is out of the bag, the genie is out of the bottle, the confetti has left the cannon[0].

It's tempting to see a world dominated by Google Bard, ChatGPT, Bing Search, etc. And no doubt, they will be huge players, with services that are far more powerful than anything that can be run on the edge.

But. BUT. The things that we can do on the edge are incredible now. Just imagine a year from now, or two. These earth-shattering models, which seem to be upending a whole industry, will soon have equivalents that run on the edge. Without services spying on your data. Without censorship on what the model can/cannot say. Because it's all local.

When was the last time this happened? There will be players who publish weights for models that are free to use. The moment that torrent magnet link is published, it's out in the wild. And smart people will package them as "one click installers" for people who aren't tech-savvy. This is already happening.

So every time you're amazed by something chat-gpt4 says, remember that soon this will be in your pocket.

[0] the "confetti" idiom brought to you by chat-gpt4.

simon833y ago

Google: "confetti has left the cannon"

> No results found for "confetti has left the cannon".

I'm amazed that a "stochastic parrot" can come up with such a beautiful idiom.

2bitencryption3y ago

For the curious, here was the conversation I had:

ME: What's another saying similar to: "The cat is out of the bag" or "The genie is out of the bottle"?

chatgpt: Another similar saying is "the toothpaste is out of the tube." Like the other two expressions, this phrase conveys the idea that something has been revealed or unleashed, and it cannot be undone or put back into its original state.

ME: Can you invent a new phrase that has the same meaning, which is completely new?

chatgpt: Certainly! Here's a new phrase with a similar meaning: "The confetti has left the cannon." This phrase suggests that once something is out in the open or has happened, it's impossible to reverse or contain it, much like confetti that has been shot out of a cannon during a celebration.

ME: What's the name for this kind of expression? Aphorism? Fable?

chatgpt: This kind of expression is known as an idiom. Idioms are phrases or expressions with a figurative meaning that is different from their literal meaning. They are commonly used in everyday language to convey a specific message or sentiment in a more colorful or memorable way.

5 more replies

tjr3y ago

Not exactly the same, but from:

https://www.instagram.com/p/CQdBiVyh5C2/?hl=en

Now that the cat is out of the bag, or, should I say the confetti is out of the… can?

mcbuilder3y ago

Is it strange that I think it's more likely that LLMs generalize and synthesize new things out of old patterns instead of just copying a singular example.

1 more reply

flir3y ago

Could be lifted from another language. Is nice though.

malshe3y ago

Now Google has one search result pointing to your comment :)

barking_biscuit3y ago

Out of distribution generations are a thing.

simon833y ago

I understand that all of this is based on some fundamental mathematics, a couple of algorithms here, and some statistical analysis there. And I find it absolutely amazing that we can use all of that knowledge and encode it into something that resembles intelligence. This makes me think if our brains and the concept of intelligence are really as special and mysterious as we assume.

athom3y ago

Try looking up "pinkie pie party cannon"

visarga3y ago

That name aged like milk. First of all, what you said. And second - a parrot can make more parrots without human help, language models can't make GPU chips. Insulting for both LLMs and parrots.

2 more replies

jazzkingrt3y ago

Serious question: is it typical to describe client-side computing as "on the edge"?

I thought running something on the edge referred to running it in close network proximity to the user, rather than users having control and running things themselves.

wsgeorge3y ago

I believe this has been extended to mean "on device", which is interesting. See Gerganov's article on Github [0]. I wrote about this here [1] where I made a contrast between the core and the edge. I think the term maps well to this meaning.

What I find more interesting is that in the classic "close network proximity", some parts of the world may not have benefited as much from that trend since the closest nodes of a global delivery network could be several countries away.

[0] https://github.com/ggerganov/llama.cpp/discussions/205

[1] https://medium.com/sort-of-like-a-tech-diary/consumer-ai-is-...

1 more reply

capableweb3y ago

Yes, "edge computing" can refer to both computing done as close to the user as possible geographically, or even on the device itself. If someone says "I wanna do edge computing" it's not clear enough to know if they just want to have servers they control as close to the user as possible, or do the computing on the device itself. I think Apple would say "edge computing" is on the actual device while CloudFlare would say "edge computing" is on their infrastructure, but distributed to be physically closer to the end user.

1 more reply

dragonwriter3y ago

> Serious question: is it typical to describe client-side computing as “on the edge”?

Somewhat; its consistent with, e.g., Google’s “Edge TPU” designation for its client-side neural processors.

> I thought running something on the edge referred to running it in close network proximity to the user

Typically, but on the client device is the limit-case of “close network proximity to the user”, so the use is consistent.

aargh_aargh3y ago

Because of the ambiguity of the term "on the edge" that is used to refer to both close network proximity and the device closest to the user, as evidenced by this thread, I would suggest to use a new term, at least in the context of A.I. The AI running on the device closest to the user should be called a "terminator".

layer83y ago

“Edge computing” arguably implies there’s a network you are connected t, that you’re on the edge of, so I wouldn’t apply the term to applications that can function completely offline. With edge computing there’s usually still a notion of having some sort of internet integration, like IoT devices.

dannyobrien3y ago

I've used "edge" in this context for around 15 years[1], and I've always intended it to mean "at the edge of the network", which can include being on the other side of the world to a user.

[1] from https://www.oblomovka.com/wp/2007/08/ at least

slowmovintarget3y ago

> Without services spying on your data. Without censorship on what the model can/cannot say. Because it's all local...

Wouldn't that be nice? It would also be contrary to all experience of the outcomes and pulls of corporations in modern society. The "local" LLMs will be on the fringe more than at the edge, because the ones that work the best and attract the most money will be the ones controlled by walled-garden "ecosystems."

I really hope it's different. I really hope there are local models. Actual personal assistants actually designed to assist their users and not the people that provide the access.

hiAndrewQuinn3y ago

I for one dream of a future without maps. I want to walk through a distant forest to find an ancient, unconnected ESP-32 in the bark of a tree containing a tiny specialized AI that can only tell me about things relevant to the area, how far to walk upstream to the nearest town. And only if I can find it and scan an RFID tag to wake it up.

vinc3y ago

A beautiful dream!

> I like to think (right now please!) of a cybernetic forest filled with pines and electronics where deer stroll peacefully past computers as if they were flowers with spinning blossoms.

hintymad3y ago

I'd go one step further if it is not happening yet: smaller companies should really pool their resources to train open LLMs. Say, form a consortium and work with the open source community to build ChatGPT-equivalent. Companies will be crazy to assume that they can hand their future to the APIs offered by a handful of companies during this monumental technological paradigm shift in history.

That is, a real OpenAI with a open government body.

lioeters3y ago

Yes, yes, and yes. I'm waiting for an actually open AI that can run on the edge, purely on commodity hardware like our laptops and phones - it's inevitable.

I imagine this "cat out of the bag" situation, the democratization and commodification of powerful technology accessible and affordable to the public, is similar to what's happening with single-board computers and microcontrollers like Raspberry Pi, Arduino, ESP32.

It might be similar to what happened with mobile phones, but there the power was quite restricted. The (mostly) duopoly of iOS and Android, with devices and apps locked down in various ways. Sure we can "jail break" and "root" our phone, but that's not for the general public.

Maybe solar energy production is going through a similar process, with panels and batteries becoming more efficient and affordable every year.

Certainly, it reminds one of the history of personal computers, the way such a powerful general-purpose tool became ubiquitous and local.

akiselev3y ago

After using ChatGPT 4 extensively for a few days, I think we're probably only a few years away from the first generation of truly conversational assistants ala Jarvis in Iron Man. Between LangChain and existing voice recognition software, we've already 95% of the way there, it just needs to be packaged up into a UI/UX that makes sense.

These local models are absolutely critical for that to happen though. I'm hitting daily situations where I have to reconsider my use of ChatGPT because what I'm asking would leak very private personal information or somehow trip its morality filter.

Just swapped in a 2TB nvme for a separate /home mount and reinstalled Arch just so I could have enough disk space to test a dozen models locally. I'm so ready!

3 more replies

visarga3y ago

> the democratization and commodification of powerful technology accessible and affordable to the public

Empowering everyone equally, very low barrier of entry. The spread of AI will make some skills become less important and other skills more important. There will be painful social changes. Memorising stuff, being exact, being careful, diligent - all these things can be done by LLM. But also creative recombination of ideas. For now we still hold supremacy in long form content, complex software projects and detecting inconsistencies, but they could be emergent qualities in GPT-5.

sbierwagen3y ago

If you thought crypto miners drove up the prices of GPUs, just wait until a gaming card can host a person who can get a job.

Integer-quantized LLaMA can fit on a 4090, and isn't quite human-level. If a fully aligned, reliably non-hallucinary turingrade LLM can run on the next generation 5 series GPUs then expect their market price to quadruple or more.

cjf1013y ago

Yes, this is true. But, I worry about how long it will take for the utility of "GPT-4" on my phone to be close enough to whatever is only possible through models running on large cloud platforms to make that choice relatively drawback free.

Is the curve of what this class of algorithms can provide sigmoid? If so, then yeah, eventually researchers should be able to democratize it sufficiently that the choice to use versions that can run on private hardware rational. But if the utility increases linearly or better over time/scale, the future will belong to whoever owns the biggest datacenters.

xnx3y ago

This is a shocking turn of events given there's no edge equivalent of the previous most powerful information tools (web-scale search). It does seem like it will still be a challenge to continuously collect, validate, and train on fresh information. Large orgs like Google/YouTube/TikTok/Microsoft still seem to have a huge advantage there.

yieldcrv3y ago

> And smart people will package them as "one click installers" for people who aren't tech-savvy. This is already happening.

Any projects I can follow? Because I haven't seen any one click installers yet that didn't begin with "first install a package manager on the command line"

2bitencryption3y ago

I was mostly referring to this project, which has some 1-click installers: https://github.com/oobabooga/text-generation-webui#alternati...

Though I have not tried those 1-click installers, instead I have been manually running it.

That project is based on the concept of this Stable Diffusion project: https://github.com/AUTOMATIC1111/stable-diffusion-webui

Which is a few months ahead (because the Stable Diffusion tech happened a few months earlier) and is definitely at a point where anyone can easily run it, locally or on a hosted environment.

I expect this "text-generation-webui" (or something like it) will be just as easy to use in the near, near future.

slickdork3y ago

Not an llm but this 1 click installer for stable diffusion is literally a 1 click installer. Impressively works.

https://github.com/cmdr2/stable-diffusion-ui

1 more reply

nathanasmith3y ago

Here's alpaca running in electron. Not exactly one click but close.

https://github.com/ItsPi3141/alpaca-electron

wing-_-nuts3y ago

>So every time you're amazed by something chat-gpt4 says, remember that soon this will be in your pocket.

I want to believe you, but I'm ignorant of the hardware requirements for these things. How soon do you think we'd be able to run something reasonably gpt4-like on, say, a 4090?

LightMachine3y ago

I feel like no less than 10 years if the singularity doesn't kick in before that. Hardware and energy isn't progressing as fast as we'd like, and that is the main bottleneck. As in, imagine a world where we actually had the same computing power required to train (not run) GPT-4 in 1s in a phone? That kind of world is way beyond AGI and the cure of cancer IMO. Which is great, because it gives us a very objective goal to achieve these things. Sadly, I don't think we're nowhere near that. What was even the total energy consumption of GPT-4 training? Very hard to imagine computers will get that much better anytime soon. IIRC we have some kind of data that the a smartphone today has the power of the best supercomputer... of 30 years ago, right? Don't remember the source, sadly.

2 more replies

bloaf3y ago

It's literally already here: https://twitter.com/thiteanish/status/1635678053853536256

1 more reply

t_minus_23y ago

The cat is out of the bag,The genie is out of the bottle,The confetti has left the cannon,The ship has sailed,The horse has bolted,The toothpaste is out of the tube,The beans have been spilled,The train has left the station,The die is cast,The bell has been run.

pmoriarty3y ago

The cookie has crumbled.

The mirror has shattered.

The poop has hit the propeller.

Pandora's box has opened.

matchagaucho3y ago

An LLM running locally providing type-ahead completions seems inevitable.

brucethemoose23y ago· 7 in thread

FYI: Cerebras's nodes are very different than your typical Nvidia training nodes:

https://www.anandtech.com/show/16626/cerebras-unveils-wafer-...

Each individual "chip" has 40GB of SRAM vs ~76MB for the Nvidia H100, and networked pools of external RAM, SSDs and such. Thats why the training architecture is so different.

arbuge3y ago

https://www.cerebras.net/product-chip/

There's a comparison picture there of one of their chips alongside a regular GPU chip. Effectively they use up the entire wafer.

brucethemoose23y ago

Yeah, and that doesn't even do the nutty IO on these things justice.

A 16x CS2 cluster like they describe is like a huge Nvidia cluster in terms of throughput, but more like a single Nvidia node structurally.

shagie3y ago

Don't forget the power consumption on them.

Cerebras Second-Gen Wafer Scale Chip: 2.6 Trillion 7nm Transistors, 850,000 Cores, 15kW of Power - https://www.tomshardware.com/news/cerebras-wafer-scale-engin...

Trying to cool 15kW to 20kW of power is also rather impressive. https://www.cerebras.net/cs2virtualtour - the engine block and cooling manifold

> The challenge of extracting more than 20 kW of heat from the wafer was solved by having the wafer "float" on a cold plate. The wafer is allowed to expand and contract while remaining in contact with the polished front side of the cold plate, despite the different coefficients of thermal expansion of copper and silicon. The cold plate is much more than a a slab of metal: advanced computational fluid dynamics modelling was used to design a labyrinth of coolant channels capable of maintaining a precise, stable, thermal environment even as 850,000 Al-optimized cores swing into action.

> The power density of the CS-2 is too high for direct air cooling, so liquid cooling is used instead. The internal manifold transfers heat between the CS-2 system's internal coolant and facilties water. Separating these two fluids ensure that the CS-2 system is not affected by changes in the quality of facilities water and that the very highest-quality coolant circulates through the cold plate.

> The two pump modules plug into the upper four dry-break connectors. The lower two are for the air-cooling or water-cooling heat exchanger.

ipsum23y ago

Cerebras makes impressive hardware, but Nvidia still performs better in every regard. The most telling factor is that Cerebras claims they're too busy to run common benchmarking (e.g. MLPerf) to compare against Nvidia.

1 more reply

alchemist1e93y ago

It’s unbelievable stuff. Does anyone know how much a single box costs? They are selling them it looks like.

4 more replies

IshKebab3y ago

It's a pretty mad architecture tbh. Compile times must be absolutely insane. Also Tesla's Dojo also uses a manufacturing technique that has basically obsoleted their WSI design already.

anon2913y ago

Compile times are not a whole lot different than any other large model build. It's a kernel based compilation pipeline and the kernels are simply tiled over a 'core' area in the weight streaming architecture.

1 more reply

eldenring3y ago· 7 in thread

> Trained using the Chinchilla formula, these models provide the highest accuracy for a given compute budget.

I'm confused as to why 111 million parameter models are trained with the Chinchilla formula. Why not scale up the training data? If you're training smaller models, surely optimizing performance is better than optimizing total compute.

Seems like a silly misunderstanding of the Chinchilla paper, but I'm sure I'm missing something

gamegoblin3y ago

True. There was a good blog post published about this a few weeks ago: https://finbarr.ca/llms-not-trained-enough/

Money quote for those who don't want to read the whole thing:

'''

When people talk about training a Chinchilla-optimal model, this is what they mean: training a model that matches their estimates for optimality. They estimated the optimal model size for a given compute budget, and the optimal number of training tokens for a given compute budget.

However, when we talk about “optimal” here, what is meant is “what is the cheapest way to obtain a given loss level, in FLOPS.” In practice though, we don’t care about the answer! This is exactly the answer you care about if you’re a researcher at DeepMind/FAIR/AWS who is training a model with the goal of reaching the new SOTA so you can publish a paper and get promoted. If you’re training a model with the goal of actually deploying it, the training cost is going to be dominated by the inference cost. This has two implications:

1) there is a strong incentive to train smaller models which fit on single GPUs

2) we’re fine trading off training time efficiency for inference time efficiency (probably to a ridiculous extent).

Chinchilla implicitly assumes that the majority of the total cost of ownership (TCO) for a LLM is the training cost. In practice, this is only the case if you’re a researcher at a research lab who doesn’t support products (e.g. FAIR/Google Brain/DeepMind/MSR). For almost everyone else, the amount of resources spent on inference will dwarf the amount of resources spent during training.

'''

sebzim45003y ago

>Chinchilla implicitly assumes that the majority of the total cost of ownership (TCO) for a LLM is the training cost. In practice, this is only the case if you’re a researcher at a research lab who doesn’t support products (e.g. FAIR/Google Brain/DeepMind/MSR). For almost everyone else, the amount of resources spent on inference will dwarf the amount of resources spent during training.

I'm not so convinced, especially if people are doing multiple training runs for hyperparameter tuning, cleaning data, fixing bugs, etc.

I would be very interested in knowing what portion of OpenAI's compute budget is training. I would not be surprised if it was a significant minority.

aiappreciator3y ago

"the training cost is going to be dominated by the inference cost."

That's only true for general-mass-consumer models.

Companies may want to fine-tune/train their own models, which don't have that many users for their narrow use cases (possibly only internal staff), will find that training cost is a substantial chunk of the TCO

haldujai3y ago

While true I think this also misses that “for almost everyone else” you’re probably not (or at least should not) be trying to optimize zero-shot performance if you have an intended high inference use case so I don’t think Chinchilla would be all that relevant.

1 more reply

ftxbro3y ago

The point of those smaller models is for the "Cerebras Scaling Law for Compute-Optimal Training" which is the straight line plot in the image at the top of their webpage when you click the link.

They want you to think it's reasonable that because the line is so straight (on a flops log scale) for so long, it could be tempting to extrapolate the pile-loss consequences of continuing compute-optimal training for larger models beyond their largest 13B one, with the obvious caveat that the extrapolation can't continue linearly much further if for no other reason than the test loss isn't going to go below zero (it will flatten out sooner than that).

If you trained beyond compute-optimality on smaller models, it would mess up their straight line and make it look like we are sooner hitting diminishing returns on test loss.

bjornsing3y ago

> the extrapolation can't continue linearly much further if for no other reason than the test loss isn't going to go below zero

Isn’t the test loss logarithmic? If so it sure can go below zero.

2 more replies

haldujai3y ago

You’re not wrong, the Chinchilla rationale is that it may be more compute efficient to obtain a given loss using larger model sizes if the budget allows. As another commenter states this ignore the inference part of the equation.

As an example the BERT/RoBERTa family were trained for much longer than Chinchilla, you do get diminishing returns though.

There is a point of overtraining where downstream performance is impacted but that’s pretty high.

I think part of the answer to this is also that xxx million parameter decoder-only models don’t seem to be that useful so it may not be worthwhile to optimize them for performance?

johnchristopher3y ago· 6 in thread

OT: I don't know about their scaling strategy for LLM but their scaling strategy for displaying pictures is disappointing.

(it's all blurry)

thewataccount3y ago

They're dynamically scaled and something must be broken. If you inspect source you can find the raw images, here's a few:

https://www.cerebras.net/wp-content/uploads/2023/03/Downstre...

https://www.cerebras.net/wp-content/uploads/2023/03/Scaling-...

EDIT: Looks like it scores better with less training - up until it matches GPT-J/Pythia/OPT and doesn't appear to have much benefit. It maybe scores slightly better then GPT-J which is pretty "eh", I'm not sure if GPT-J level performance is really useful for anything? NeoX 20B outperforms it in everything if you don't care about the amount of training needed.

Does the better performance for less training matter if that benefit only applies when it's only performing a lot worse then GPT-J? It appears to lose it's scaling benefits before the performance is interesting enough to matter?

johnchristopher3y ago

Disabling ublock on Firefox fixes the issue for me.

edit: scratch that, it seems the AJAX endpoint returns 504 more often that not.

1 more reply

Kelamir3y ago

Last time I viewed it, I believe it wasn't blurry. Perhaps to scale the traffic the images are now displayed in lower quality?

But I'm not sure anymore that it wasn't initially blurry... Perhaps I'm hallucinating, like large language models.

Current image displayed is https://www.cerebras.net/wp-content/uploads/2023/03/Scaling-... , will see if it changes.

Kelamir3y ago

I can confirm, it does change. As of now, it displays one of higher quality: https://www.cerebras.net/wp-content/uploads/2023/03/Scaling-...

ricopags3y ago

Came here to point this out, though not as pithily :D

Really, really bad mark on whoever is in charge of their web marketing. Images should never look that bad, not even in support, but definitely not in marketing.

edit: so this post is more useful, 4k res using Edge browser

lanshanbob3y ago

I think it's fixed now

amilios3y ago· 6 in thread

Comparing the 13B model here https://huggingface.co/cerebras/Cerebras-GPT-13B to LLaMA-13B https://github.com/facebookresearch/llama/blob/main/MODEL_CA... you can see that in all of the reasoning tasks Cerebras-GPT lags behind. Any reason to use Cerebras instead of LLaMA? Doesn't seem like it.

potatoman223y ago

Can the LLaMA weights be used for commercial products?

espadrine3y ago

There are two aspects to it.

The first one is whether they would actually sue. The optics would be terrible. A similar situation occurred in the 90s when the RC4 cipher’s code was leaked. Everyone used the leaked code pretending that it was a new cipher called arc4random, even though they had confirmation from people that licensed the cipher that its output was identical. Nobody was sued, and the RSA company never acknowledged it.

The second one is related to the terms. The LLaMA weights themselves are licensed under terms that exclude commercial use:[0]

> You will not […] use […] the Software Products (or any derivative works thereof, works incorporating the Software Products, or any data produced by the Software), […] for […] any commercial or production purposes.

But the definition of derivative works is gray. AFAIK, if LLaMA is distilled, there is an unsettled argument to be had that the end result is not a LLaMA derivative, and cannot be considered copyright or license infringement, similar to how models trained on blog articles and tweets are not infringing on those authors’ copyright or licensing. The people that make the new model may be in breach of the license if they agreed to it, but maybe not the people that use that new model. Otherwise, ad absurdum, a model trained on the Internet will have content that was generated by LLaMA in its training set, so all models trained on the Internet after Feb 2023 will break the license.

IANAL, but ultimately, Meta wins more by benefiting from what the community contributes on top of their work (similar to what happened with React), than by suing developers that use derivatives of their open models.

[0]: https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z...

gpm3y ago

Unclear, likely jurisdiction dependent, almost certainly not if you need to operate world wide.

mdagostino3y ago

LLaMA is non-commercial

option3y ago

it lags behind because according to their blogpost it was trained on <300B tokens. LLaMAs as far as I know were trained on more than trillion

gpm3y ago

The LLaMa paper says 1 trillion for the smaller models (7B, 13B) and 1.4 trillion for the larger models (30B, 65B)

tombert3y ago· 6 in thread

Has anyone tried this? I have 96GB of GPU memory; will that be enough to run the biggest model?

spi3y ago

I have not tried, but 96GB of GPU memory is plenty, for inference there should certainly be no issue. Their biggest model has 13B parameters, you should be able to run inference (float16) already with 32GB of memory.

With 96GB of memory you should also be able to fine-tune it (possibly some tricks like gradient accumulation and/or checkpointing might be needed), but you have to be ready for many days of computation...

alchemist1e93y ago

> but you have to be ready for many days of computation...

I was thinking since we have API prices in tokens and now it looks like self hosted inference on high end GPUs for similar models. Then based on electricity prices there will be a self-hosted prices in tokens. Then how close are these already? What is the markup today from roughly the raw electricity cost that OpenAI has.

cuuupid3y ago

13B fits nicely even in a 3090 (24gb vram)!

Filligree3y ago

Only if you can use int8, which may or may not be convenient.

grishka3y ago

Does that mean I would be able to run it on an M1 Max with 64 gb unified memory?

lxe3y ago

with load_in_8bits ?

binarymax3y ago· 5 in thread

Here are the zero-shot accuracy numbers posted in the Huggingface evaluations for Cerebras-GPT 13B vs. the results of LLaMa 13B in their paper:

    Model              BoolQ PIQA SIQA HellaSwag WinoGrande ARC-e ARC-c OBQA
    LLaMa 13B          78.1  80.1 50.4 79.2      73         74.8  52.7  56.4
    Cerebras-GPT 13B   -     76.6 -    51.3      64.6       71.4  36.7  28.6

smaddox3y ago

Cerebras is "training compute optimal". Llama appears to be trained far beyond "training compute optimal". The tradeoff is that inference is closer to optimal for Llama, i.e. better performance with a smaller model.

wsgeorge3y ago

I guess it's something. It still goes to show how far open models are behind the proprietary SOTA.

topynate3y ago

From their discord:

> It would be interesting to know why you chose those FLOPS targets, unfortunately it looks like the models are quite under pre-trained (260B tokens for 13B model)

> We chose to train these models to 20 tokens per param to fit a scaling law to the Pile data set. These models are optimal for a fixed compute budget, not necessarily "best for use". If you had a fixed parameter budget (e.g., because you wanted to fit models on certain hardware) you would train on more tokens. We do that for our customers that seek that performance and want to get LLaMA-like quality with a commercial license

Which is the point made elsewhere in these comments, e.g. https://news.ycombinator.com/item?id=35344192, and also usefully shows how open Cerebras are. They're pretty open, but not as much as they would be if they were optimising for filling in other companies' moats.

binarymax3y ago

Indeed but this is zero-shot performance. Fine-tuning for a task should get you pretty good results. I'm interested in seeing the results of an Alpaca method against this Cerebras 13B model.

3 more replies

riku_iki3y ago

Have these models been trained on the same dataset? Otherwise it is apples to oranges comparison.

whalesalad3y ago· 5 in thread

This “AI spring” is really snowballing with the crazy nouns and terminology. Alpaca, llama and now chinchilla??

murkt3y ago

Chinchilla actually came before alpaca and llama.

Every new variation of model gets some new name, just like every library gets a new name. There were all kinds of BERTs before - DistilBert, Roberta, SciBERT, Schmobert, Schmuber, etc. Many hundreds of them, I think.

ramesh19943y ago

The term "chinchilla" predates llama/alpaca. It doesn't directly map to a specific model, rather a family of compute-optimal models.

jhbadger3y ago

As mentioned, chinchilla is not part of this trend, and chinchillas are rodents. Alpacas and llamas are South American camelids (animals related to camels). So if additional names are needed, I would expect them to be vicuña and guanaco, as they are also in the group.

mometsi3y ago

I think the relevant category is "Adorable Fuzzy Critters of the Andes". See also https://en.wikipedia.org/wiki/Spectacled_bear

whoisnnamdi3y ago

Chinchilla actually came first!

chessgecko3y ago· 4 in thread

I wonder what led to such a gap between llama 7b and Cerebras 13b. I hope they discuss it in the paper.

ftxbro3y ago

This gap makes sense to me. The academic point of the Cerebras paper is to show their nice empirical scaling law for compute-optimal training, whereas the academic point of the LLaMA paper was to show that you can make small models punch above their weight by training them in a way that is deliberately not compute-optimal. Of course both of those publications had other academic and marketing purposes.

From the Cerebras blog post: "Trained using the Chinchilla formula, these models provide the highest accuracy for a given compute budget."

From the LLaMA paper: "The focus of this work is to train a series of language models that achieve the best possible performance at various inference budgets, by training on more tokens than what is typically used."

gpm3y ago

Is there a benchmark comparing the two that I missed?

Edit: The huggingface page has 0-shot benchmarks which you can compare against the llama paper

https://huggingface.co/cerebras/Cerebras-GPT-13B

https://arxiv.org/pdf/2302.13971.pdf

freeqaz3y ago

I'm on mobile and struggled to compare these two tables properly. Would you mind posting a summary of your findings?

Here are some values but I don't know what they mean. LLama 60B on the left, Cerebras 13B on the right.

PiQA: 82.8 / 76.6 WinoGrade: 77.0 / 64.6 ARC-e: 78.9 / 71.4

1 more reply

jumpCastle3y ago

Looks like llama 7b was trained on 4 times more tokens.

mometsi3y ago· 3 in thread

Summary: This is a company that makes AI accelerator ICs. They reimplemented Chinchilla and released the model weights under a permissive license.

bogwog3y ago

In other words, they’re actually incentivized to help make LLMs as accessible as possible, rather than try to keep them locked up to hide them from competitors.

Which makes me wonder if Nvidia is doing anything with LLMs too?

vintermann3y ago

NVidia has certainly pushing the envelope on image generation. StyleGAN3 was really cool when it came.

But it is an issue that their chips are hardly optimized for LLMs.

2 more replies

option3y ago

obviously yes, there are larger (20B) models trained on more tokens that are released by NVIDIA under commercial-friendly license on HF

https://huggingface.co/nvidia/nemo-megatron-gpt-20B

Garcia983y ago· 3 in thread

I've been following open source LLMs for a while and at first glance this doesn't seem too powerful compared to other open models, Flan-Alpaca[0] is licensed under Apache 2.0, and it seems to perform much better. Although I'm not sure about the legalities about that licensing, since it's basically Flan-T5 fine-tuned using the Alpaca dataset (which is under a Non-Commercial license).

Nonetheless, it's exciting to see all these open models popping up, and I hope that a LLM equivalent to Stable Diffusion comes sooner than later.

[0]: https://github.com/declare-lab/flan-alpaca

alchemist1e93y ago

Sounds like you might be the right person to ask the “big” question.

For a small organization or individual who is technically competent and wants to try and do self-hosted inference.

What open model is showing the most promise and how does it’s results compare to the various openAI GPTs?

A simple example problem would be asking for a summary of code. I’ve found openAI’s GPT 3.5 and 4 to give pretty impressive english descriptions of code. Running that locally in batch would retain privacy and even if slow could just be kept running.

Garcia983y ago

Google's Flan-T5, Flan-UL2 and derivatives, are so far the most promising open (including commercial use) models that I have tried, however they are very "general purpose" and don't perform well in specific tasks like code understanding or generation. You could fine-tune Flan-T5 with a dataset that suits your specific task and get much better results, as shown by Flan-Alpaca.

Sadly, there's no open model yet that acts like a Swiss knife and gets good-enough results for multiple use cases.

1 more reply

ftxbro3y ago

Their goal isn't to make a powerful model. It's to show how well compute-optimal models do on test-loss as a function of increasing model size. This function can be used with some caveats to forecast the test-loss of larger models for which compute-optimality becomes more important.

antimatter153y ago· 3 in thread

Looking at their charts it seems like their 6.7B model is considerably worse than GPT-J which is an existing open 6B model from several years ago.

I wish rather than stopping training early they would have run more data through a small model so we could have something more competitive with LLaMA 7B.

cs-fan-1013y ago

Someone posted this repost from the Cerebras Discord earlier, but sharing for visibility -

"We chose to train these models to 20 tokens per param to fit a scaling law to the Pile data set. These models are optimal for a fixed compute budget, not necessarily "best for use". If you had a fixed parameter budget (e.g., because you wanted to fit models on certain hardware) you would train on more tokens. We do that for our customers that seek that performance and want to get LLaMA-like quality with a commercial license"

HanClinto3y ago

Sounds like we should crowd-fund the cost to train and open source one of these models with LLaMa-like quality.

I'd chip in!

1 more reply

brucethemoose23y ago

> Cerebras Discord

This is really weird to hear out loud.

I still think of Discord as a niche gaming chatroom, even though I know that (for instance) a wafer scale IC design company is hosting a Discord now.

JamesCoyne3y ago· 3 in thread

Slightly off-topic:

I remember seeing news about the enormous chip Cerebras was/is selling (pdf https://f.hubspotusercontent30.net/hubfs/8968533/WSE-2%20Dat...).

Has there been any indication that the LLMs released in the last few months use exotic hardware like this, or is it all "standard" hardware?

wmf3y ago

OpenAI uses Nvidia GPUs and Google uses their TPUs.

ipsum23y ago

Everyone except Google uses Nvidia for training. Cerebras, Gaudi, and other custom AI accelerators have unable to surpass Nvidia in performance/$ and performance/watt yet.

brucethemoose23y ago

You might see more training on Intel XPUs when they come out, since they have such enormous RAM pools. Maybe AMD MI300s and Intel Ponte Vecchio (both 128GB) in the shorter term, though I think they will mostly be in HPC supercomputers instead of cloud instances.

rnosov3y ago· 2 in thread

I might be missing something but it looks to me that actually running this "open" model requires special hardware only accessible with a cloud subscription with 60 000 USD / week minimum spend[1]. Can anyone confirm if you can run it on your own hardware? If software is open but hardware is locked I don't see the point.

[1] https://www.hpcwire.com/2021/09/16/cerebras-wafer-scale-engi....

EDIT: Ok, looks like I've missed the hugging face repo. The language they use is a bit confusing.

simonw3y ago

The PyTorch model files are already available to download from Hugging Face - the largest one looks to be 52GB. They should run on any hardware that can run regular PyTorch models.

bubblethink3y ago

You can run inference on GPUs. These are just models and weights.

simonw3y ago· 2 in thread

Does the chinchilla recipe still hold today? I got the impression that the LLaMA paper proposed a different result where throwing far more tokens at the problem had a very meaningful impact, or did I misunderstand that?

evanmays3y ago

There’s discussion elsewhere in this thread what chinchilla actually means. I’ll only compare it to llama.

Tldr; Chinchilla isn’t wrong, it’s just useful for a different goal than the llama paper.

There’s 3 hyper parameters to tweak here. Model size (parameter count), number of tokens pre trained on, and amount of compute available. End performance is in theory a function of these three hyperparameters.

You can think of this as an optimization function.

Chinchilla says, if you have a fixed amount of compute, here’s what size and number of tokens to train for maximum performance.

A lot of times, we have a fixed model size though though, because size impact inference costs and latency. Llama operates in this territory. They choose to fix the model size instead of the amount of compute.

This could explain gaps in performance between Cerebras models of size X and llama models of size X. Llama models of size X have way more compute behind them

espadrine3y ago

I don’t think it holds for two reasons.

First, it only holds for a given architecture and implementation. Obviously, a different architecture will have a different training slope. This is clear when comparing LSTM with Transformers, but is also true between transformers that use prenorm/SwiGLU/rotary-positional, and those that follow Vaswani 2017.

In terms of implementation, some algorithms yield the same result with fewer operations (IO, like FlashAttention and other custom CUDA kernels, and parallelism, like PaLM, which both came after Chinchilla), which unambiguously affect the Tflops side of the Chinchilla equation. Also, faster algorithms and better parallelization will yield a given loss sooner, while less power-hunger setups will do that cheaper.

Second, even in the original Chinchilla paper in figure 2, some lines are stopped early before reaching Pareto (likely because it ran out of tokens, but LLaMA makes it seem that >1 epoch training is fine).

skybrian3y ago· 1 in thread

What’s in the Pile training data they used? How much source code does it include?

sanxiyn3y ago

https://arxiv.org/abs/2101.00027 is the paper and it includes 95.16 GiB from GitHub.

simonw3y ago· 1 in thread

"Cerebras open sources seven GPT-3 models from 111 million to 13 billion parameters."

I don't understand why they describe them as GPT-3 models here as opposed to calling them GPT models. Or even LLMs - but I guess that acronym isn't as widely recognized.

wsgeorge3y ago

I think GPT-3 is used as a benchmark for performance, so saying a model is on par with GPT-3 should give you an idea of what you can get out of it.

IIRC most open source models to date - including the semi-open LLaMAs - have GPT-3-like performance. Nothing gets close to GPT-3.5 and beyond.

AlexanderTheGr83y ago· 1 in thread

Is there a regularly updated repository containing all the releases of LLMs as they happen? TBH I am tired of having to doommark (doom-bookmark) so many repositories and links...Would appreciate some collected database.

adt3y ago

This is close, table of LLMs as released, and I try and add repos for the 'open' models: https://lifearchitect.ai/models-table/

wg03y ago· 1 in thread

Noob to ML in practice. These models containing weights, all of them, do they have a standard file/binary format?

examplary_cable3y ago

[I'm not an expert] but I believe .ckpt and .safetensors.

The problem with .ckpt is that it executes arbitrary code in your machine(very unsafe). While .safetensors was made by huggingface in order to have a safe format to store the weights. I've also seen people load up the llama 7B via a .bin file.

dukeofdoom3y ago· 1 in thread

I wonder how decrotive our world will become as a consequence of how cheap it will become to make art using AI.

I kind of want 3d marble statues and baroque art of a future reinasance everywhere. But wonder if we will turn minimalistic as a response.

ericd3y ago

I've been wondering about the best way to print large format versions of custom Renaissance-style paintings with goofy subjects for our walls at home. I guess I have to figure out how to best upscale the output first.

visarga3y ago

Of course this is great news, I hope these models can be fine-tuned to be like lighter versions of chatGPT. But I remember reading in the LLaMA paper that a small model can still improve when trained more than the Chinchilla budget.

> For instance, although Hoffmann et al. (2022) recommends training a 10B model on 200B tokens, we find that the performance of a 7B model continues to improve even after 1T tokens.

Cerebras says:

> For instance, training a small model with too much data results in diminishing returns and less accuracy gains per FLOP

But this is only of concern when you care about the training cost, such as when you are budget limited researcher or a company who doesn't deploy models at scale. But when you care about the total cost of deployment, then making a small model even better with lots of data is a smart move. In the end it matters more to have the most efficient model in prediction, not the most efficient model in training.

simonw3y ago

You can try out some of these models on Hugging face here: https://huggingface.co/cerebras/Cerebras-GPT-1.3B

That was the largest that had inference enabled - I'd really like to try this one: https://huggingface.co/cerebras/Cerebras-GPT-13B

eternalban3y ago

> It takes substantial technical expertise to train very large models on GPUs. In the recently released GPT-4 Technical Report, OpenAI credits over thirty contributors just for compute infrastructure and scaling.

This is called a silver lining for some (in case you were worried about gpt taking your job). Privacy requirements alone will in the near term force major companies to run their own inference (if not training). The expertise required are nearly identical to that of running large scale distributed computational graphs.

This is an interesting diveragence from what happened with web. The backends started out simple before map-reduce and before deconstructing databases and processing distributed logs. With ML, we'll jump right into the complex backends in tandem with easy-picking early stage edge applications (which we see daily on HN).

mark_l_watson3y ago

Even though I usually use OpenAI's APIs, just because that is the easiest path, I do also use Hugging Face open models (via their APIs, and running locally) and I will check out Cerebras also.

Alternatives are good!

ioulaum3y ago

I wonder if they've done some Alpaca style training on it... Granted, what made Alpaca useful was that it was finetuned with GPT-3's instruction following completions as examples.

And, at least officially, OpenAI's outputs can't be used to train other AI models.

Otherwise, if GPT-4 outputs were used to finetune these models, they may become much more interesting.

rbanffy3y ago

A tangential question: I wonder what, as chiplets become increasingly more common, will Cerebras do to keep their technological advantage of wafer-scale integration. What is the bandwidth and latency of the connections between the tiles? Is there such a thing as bandwidth per frontier length?

fuzzieozzie3y ago

Cerebras has an efficiency advantage at generating LLMs (assuming IP is open). This is going to be fun to be a part of.

ivanvas3y ago

Is it currently possible to find-tune any of the foundation modules available on a few Gb of unsupervised text?

ftxbro3y ago

> Our paper, which will be available soon, will detail our training methods and performance results.

Yay there will be a paper let's gooooooo!

j / k navigate · click thread line to collapse

232 comments

133 comments · 29 top-level

2bitencryption3y ago· 37 in thread

This type of article (or press release, or whatever you want to call it) is exactly what makes the future so interesting.

The cat is out of the bag, the genie is out of the bottle, the confetti has left the cannon[0].

So every time you're amazed by something chat-gpt4 says, remember that soon this will be in your pocket.

[0] the "confetti" idiom brought to you by chat-gpt4.

simon833y ago

Google: "confetti has left the cannon"

> No results found for "confetti has left the cannon".

I'm amazed that a "stochastic parrot" can come up with such a beautiful idiom.

2bitencryption3y ago

For the curious, here was the conversation I had:

ME: What's another saying similar to: "The cat is out of the bag" or "The genie is out of the bottle"?

ME: Can you invent a new phrase that has the same meaning, which is completely new?

ME: What's the name for this kind of expression? Aphorism? Fable?

5 more replies

tjr3y ago

Not exactly the same, but from:

https://www.instagram.com/p/CQdBiVyh5C2/?hl=en

Now that the cat is out of the bag, or, should I say the confetti is out of the… can?

mcbuilder3y ago

Is it strange that I think it's more likely that LLMs generalize and synthesize new things out of old patterns instead of just copying a singular example.

1 more reply

flir3y ago

Could be lifted from another language. Is nice though.

malshe3y ago

Now Google has one search result pointing to your comment :)

barking_biscuit3y ago

Out of distribution generations are a thing.

simon833y ago

athom3y ago

Try looking up "pinkie pie party cannon"

visarga3y ago

That name aged like milk. First of all, what you said. And second - a parrot can make more parrots without human help, language models can't make GPU chips. Insulting for both LLMs and parrots.

2 more replies

jazzkingrt3y ago

Serious question: is it typical to describe client-side computing as "on the edge"?

I thought running something on the edge referred to running it in close network proximity to the user, rather than users having control and running things themselves.

wsgeorge3y ago

[0] https://github.com/ggerganov/llama.cpp/discussions/205

[1] https://medium.com/sort-of-like-a-tech-diary/consumer-ai-is-...

1 more reply

capableweb3y ago

1 more reply

dragonwriter3y ago

> Serious question: is it typical to describe client-side computing as “on the edge”?

Somewhat; its consistent with, e.g., Google’s “Edge TPU” designation for its client-side neural processors.

> I thought running something on the edge referred to running it in close network proximity to the user

Typically, but on the client device is the limit-case of “close network proximity to the user”, so the use is consistent.

aargh_aargh3y ago

layer83y ago

dannyobrien3y ago

I've used "edge" in this context for around 15 years[1], and I've always intended it to mean "at the edge of the network", which can include being on the other side of the world to a user.

[1] from https://www.oblomovka.com/wp/2007/08/ at least

slowmovintarget3y ago

> Without services spying on your data. Without censorship on what the model can/cannot say. Because it's all local...

I really hope it's different. I really hope there are local models. Actual personal assistants actually designed to assist their users and not the people that provide the access.

hiAndrewQuinn3y ago

vinc3y ago

A beautiful dream!

> I like to think (right now please!) of a cybernetic forest filled with pines and electronics where deer stroll peacefully past computers as if they were flowers with spinning blossoms.

hintymad3y ago

That is, a real OpenAI with a open government body.

lioeters3y ago

Yes, yes, and yes. I'm waiting for an actually open AI that can run on the edge, purely on commodity hardware like our laptops and phones - it's inevitable.

Maybe solar energy production is going through a similar process, with panels and batteries becoming more efficient and affordable every year.

Certainly, it reminds one of the history of personal computers, the way such a powerful general-purpose tool became ubiquitous and local.

akiselev3y ago

Just swapped in a 2TB nvme for a separate /home mount and reinstalled Arch just so I could have enough disk space to test a dozen models locally. I'm so ready!

3 more replies

visarga3y ago

> the democratization and commodification of powerful technology accessible and affordable to the public

sbierwagen3y ago

If you thought crypto miners drove up the prices of GPUs, just wait until a gaming card can host a person who can get a job.

cjf1013y ago

xnx3y ago

yieldcrv3y ago

> And smart people will package them as "one click installers" for people who aren't tech-savvy. This is already happening.

Any projects I can follow? Because I haven't seen any one click installers yet that didn't begin with "first install a package manager on the command line"

2bitencryption3y ago

I was mostly referring to this project, which has some 1-click installers: https://github.com/oobabooga/text-generation-webui#alternati...

Though I have not tried those 1-click installers, instead I have been manually running it.

That project is based on the concept of this Stable Diffusion project: https://github.com/AUTOMATIC1111/stable-diffusion-webui

Which is a few months ahead (because the Stable Diffusion tech happened a few months earlier) and is definitely at a point where anyone can easily run it, locally or on a hosted environment.

I expect this "text-generation-webui" (or something like it) will be just as easy to use in the near, near future.

slickdork3y ago

Not an llm but this 1 click installer for stable diffusion is literally a 1 click installer. Impressively works.

https://github.com/cmdr2/stable-diffusion-ui

1 more reply

nathanasmith3y ago

Here's alpaca running in electron. Not exactly one click but close.

https://github.com/ItsPi3141/alpaca-electron

wing-_-nuts3y ago

>So every time you're amazed by something chat-gpt4 says, remember that soon this will be in your pocket.

I want to believe you, but I'm ignorant of the hardware requirements for these things. How soon do you think we'd be able to run something reasonably gpt4-like on, say, a 4090?

LightMachine3y ago

2 more replies

bloaf3y ago

It's literally already here: https://twitter.com/thiteanish/status/1635678053853536256

1 more reply

t_minus_23y ago

pmoriarty3y ago

The cookie has crumbled.

The mirror has shattered.

The poop has hit the propeller.

Pandora's box has opened.

matchagaucho3y ago

An LLM running locally providing type-ahead completions seems inevitable.

brucethemoose23y ago· 7 in thread

FYI: Cerebras's nodes are very different than your typical Nvidia training nodes:

https://www.anandtech.com/show/16626/cerebras-unveils-wafer-...

Each individual "chip" has 40GB of SRAM vs ~76MB for the Nvidia H100, and networked pools of external RAM, SSDs and such. Thats why the training architecture is so different.

arbuge3y ago

https://www.cerebras.net/product-chip/

There's a comparison picture there of one of their chips alongside a regular GPU chip. Effectively they use up the entire wafer.

brucethemoose23y ago

Yeah, and that doesn't even do the nutty IO on these things justice.

A 16x CS2 cluster like they describe is like a huge Nvidia cluster in terms of throughput, but more like a single Nvidia node structurally.

shagie3y ago

Don't forget the power consumption on them.

Cerebras Second-Gen Wafer Scale Chip: 2.6 Trillion 7nm Transistors, 850,000 Cores, 15kW of Power - https://www.tomshardware.com/news/cerebras-wafer-scale-engin...

Trying to cool 15kW to 20kW of power is also rather impressive. https://www.cerebras.net/cs2virtualtour - the engine block and cooling manifold

> The two pump modules plug into the upper four dry-break connectors. The lower two are for the air-cooling or water-cooling heat exchanger.

ipsum23y ago

1 more reply

alchemist1e93y ago

It’s unbelievable stuff. Does anyone know how much a single box costs? They are selling them it looks like.

4 more replies

IshKebab3y ago

It's a pretty mad architecture tbh. Compile times must be absolutely insane. Also Tesla's Dojo also uses a manufacturing technique that has basically obsoleted their WSI design already.

anon2913y ago

1 more reply

eldenring3y ago· 7 in thread

> Trained using the Chinchilla formula, these models provide the highest accuracy for a given compute budget.

Seems like a silly misunderstanding of the Chinchilla paper, but I'm sure I'm missing something

gamegoblin3y ago

True. There was a good blog post published about this a few weeks ago: https://finbarr.ca/llms-not-trained-enough/

Money quote for those who don't want to read the whole thing:

'''

1) there is a strong incentive to train smaller models which fit on single GPUs

2) we’re fine trading off training time efficiency for inference time efficiency (probably to a ridiculous extent).

'''

sebzim45003y ago

I'm not so convinced, especially if people are doing multiple training runs for hyperparameter tuning, cleaning data, fixing bugs, etc.

I would be very interested in knowing what portion of OpenAI's compute budget is training. I would not be surprised if it was a significant minority.

aiappreciator3y ago

"the training cost is going to be dominated by the inference cost."

That's only true for general-mass-consumer models.

haldujai3y ago

1 more reply

ftxbro3y ago

The point of those smaller models is for the "Cerebras Scaling Law for Compute-Optimal Training" which is the straight line plot in the image at the top of their webpage when you click the link.

If you trained beyond compute-optimality on smaller models, it would mess up their straight line and make it look like we are sooner hitting diminishing returns on test loss.

bjornsing3y ago

> the extrapolation can't continue linearly much further if for no other reason than the test loss isn't going to go below zero

Isn’t the test loss logarithmic? If so it sure can go below zero.

2 more replies

haldujai3y ago

As an example the BERT/RoBERTa family were trained for much longer than Chinchilla, you do get diminishing returns though.

There is a point of overtraining where downstream performance is impacted but that’s pretty high.

I think part of the answer to this is also that xxx million parameter decoder-only models don’t seem to be that useful so it may not be worthwhile to optimize them for performance?

johnchristopher3y ago· 6 in thread

OT: I don't know about their scaling strategy for LLM but their scaling strategy for displaying pictures is disappointing.

(it's all blurry)

thewataccount3y ago

They're dynamically scaled and something must be broken. If you inspect source you can find the raw images, here's a few:

https://www.cerebras.net/wp-content/uploads/2023/03/Downstre...

https://www.cerebras.net/wp-content/uploads/2023/03/Scaling-...

johnchristopher3y ago

Disabling ublock on Firefox fixes the issue for me.

edit: scratch that, it seems the AJAX endpoint returns 504 more often that not.

1 more reply

Kelamir3y ago

Last time I viewed it, I believe it wasn't blurry. Perhaps to scale the traffic the images are now displayed in lower quality?

But I'm not sure anymore that it wasn't initially blurry... Perhaps I'm hallucinating, like large language models.

Current image displayed is https://www.cerebras.net/wp-content/uploads/2023/03/Scaling-... , will see if it changes.

Kelamir3y ago

I can confirm, it does change. As of now, it displays one of higher quality: https://www.cerebras.net/wp-content/uploads/2023/03/Scaling-...

ricopags3y ago

Came here to point this out, though not as pithily :D

Really, really bad mark on whoever is in charge of their web marketing. Images should never look that bad, not even in support, but definitely not in marketing.

edit: so this post is more useful, 4k res using Edge browser

lanshanbob3y ago

I think it's fixed now

amilios3y ago· 6 in thread

potatoman223y ago

Can the LLaMA weights be used for commercial products?

espadrine3y ago

There are two aspects to it.

The second one is related to the terms. The LLaMA weights themselves are licensed under terms that exclude commercial use:[0]

[0]: https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z...

gpm3y ago

Unclear, likely jurisdiction dependent, almost certainly not if you need to operate world wide.

mdagostino3y ago

LLaMA is non-commercial

option3y ago

it lags behind because according to their blogpost it was trained on <300B tokens. LLaMAs as far as I know were trained on more than trillion

gpm3y ago

The LLaMa paper says 1 trillion for the smaller models (7B, 13B) and 1.4 trillion for the larger models (30B, 65B)

tombert3y ago· 6 in thread

Has anyone tried this? I have 96GB of GPU memory; will that be enough to run the biggest model?

spi3y ago

alchemist1e93y ago

> but you have to be ready for many days of computation...

cuuupid3y ago

13B fits nicely even in a 3090 (24gb vram)!

Filligree3y ago

Only if you can use int8, which may or may not be convenient.

grishka3y ago

Does that mean I would be able to run it on an M1 Max with 64 gb unified memory?

lxe3y ago

with load_in_8bits ?

binarymax3y ago· 5 in thread

Here are the zero-shot accuracy numbers posted in the Huggingface evaluations for Cerebras-GPT 13B vs. the results of LLaMa 13B in their paper:

    Model              BoolQ PIQA SIQA HellaSwag WinoGrande ARC-e ARC-c OBQA
    LLaMa 13B          78.1  80.1 50.4 79.2      73         74.8  52.7  56.4
    Cerebras-GPT 13B   -     76.6 -    51.3      64.6       71.4  36.7  28.6

smaddox3y ago

wsgeorge3y ago

I guess it's something. It still goes to show how far open models are behind the proprietary SOTA.

topynate3y ago

From their discord:

> It would be interesting to know why you chose those FLOPS targets, unfortunately it looks like the models are quite under pre-trained (260B tokens for 13B model)

binarymax3y ago

Indeed but this is zero-shot performance. Fine-tuning for a task should get you pretty good results. I'm interested in seeing the results of an Alpaca method against this Cerebras 13B model.

3 more replies

riku_iki3y ago

Have these models been trained on the same dataset? Otherwise it is apples to oranges comparison.

whalesalad3y ago· 5 in thread

This “AI spring” is really snowballing with the crazy nouns and terminology. Alpaca, llama and now chinchilla??

murkt3y ago

Chinchilla actually came before alpaca and llama.

ramesh19943y ago

The term "chinchilla" predates llama/alpaca. It doesn't directly map to a specific model, rather a family of compute-optimal models.

jhbadger3y ago

mometsi3y ago

I think the relevant category is "Adorable Fuzzy Critters of the Andes". See also https://en.wikipedia.org/wiki/Spectacled_bear

whoisnnamdi3y ago

Chinchilla actually came first!

chessgecko3y ago· 4 in thread

I wonder what led to such a gap between llama 7b and Cerebras 13b. I hope they discuss it in the paper.

ftxbro3y ago

From the Cerebras blog post: "Trained using the Chinchilla formula, these models provide the highest accuracy for a given compute budget."

gpm3y ago

Is there a benchmark comparing the two that I missed?

Edit: The huggingface page has 0-shot benchmarks which you can compare against the llama paper

https://huggingface.co/cerebras/Cerebras-GPT-13B

https://arxiv.org/pdf/2302.13971.pdf

freeqaz3y ago

I'm on mobile and struggled to compare these two tables properly. Would you mind posting a summary of your findings?

Here are some values but I don't know what they mean. LLama 60B on the left, Cerebras 13B on the right.

PiQA: 82.8 / 76.6 WinoGrade: 77.0 / 64.6 ARC-e: 78.9 / 71.4

1 more reply

jumpCastle3y ago

Looks like llama 7b was trained on 4 times more tokens.

mometsi3y ago· 3 in thread

Summary: This is a company that makes AI accelerator ICs. They reimplemented Chinchilla and released the model weights under a permissive license.

bogwog3y ago

In other words, they’re actually incentivized to help make LLMs as accessible as possible, rather than try to keep them locked up to hide them from competitors.

Which makes me wonder if Nvidia is doing anything with LLMs too?

vintermann3y ago

NVidia has certainly pushing the envelope on image generation. StyleGAN3 was really cool when it came.

But it is an issue that their chips are hardly optimized for LLMs.

2 more replies

option3y ago

obviously yes, there are larger (20B) models trained on more tokens that are released by NVIDIA under commercial-friendly license on HF

https://huggingface.co/nvidia/nemo-megatron-gpt-20B

Garcia983y ago· 3 in thread

Nonetheless, it's exciting to see all these open models popping up, and I hope that a LLM equivalent to Stable Diffusion comes sooner than later.

[0]: https://github.com/declare-lab/flan-alpaca

alchemist1e93y ago

Sounds like you might be the right person to ask the “big” question.

For a small organization or individual who is technically competent and wants to try and do self-hosted inference.

What open model is showing the most promise and how does it’s results compare to the various openAI GPTs?

Garcia983y ago

Sadly, there's no open model yet that acts like a Swiss knife and gets good-enough results for multiple use cases.

1 more reply

ftxbro3y ago

antimatter153y ago· 3 in thread

Looking at their charts it seems like their 6.7B model is considerably worse than GPT-J which is an existing open 6B model from several years ago.

I wish rather than stopping training early they would have run more data through a small model so we could have something more competitive with LLaMA 7B.

cs-fan-1013y ago

Someone posted this repost from the Cerebras Discord earlier, but sharing for visibility -

HanClinto3y ago

Sounds like we should crowd-fund the cost to train and open source one of these models with LLaMa-like quality.

I'd chip in!

1 more reply

brucethemoose23y ago

> Cerebras Discord

This is really weird to hear out loud.

I still think of Discord as a niche gaming chatroom, even though I know that (for instance) a wafer scale IC design company is hosting a Discord now.

JamesCoyne3y ago· 3 in thread

Slightly off-topic:

I remember seeing news about the enormous chip Cerebras was/is selling (pdf https://f.hubspotusercontent30.net/hubfs/8968533/WSE-2%20Dat...).

Has there been any indication that the LLMs released in the last few months use exotic hardware like this, or is it all "standard" hardware?

wmf3y ago

OpenAI uses Nvidia GPUs and Google uses their TPUs.

ipsum23y ago

Everyone except Google uses Nvidia for training. Cerebras, Gaudi, and other custom AI accelerators have unable to surpass Nvidia in performance/$ and performance/watt yet.

brucethemoose23y ago

rnosov3y ago· 2 in thread

[1] https://www.hpcwire.com/2021/09/16/cerebras-wafer-scale-engi....

EDIT: Ok, looks like I've missed the hugging face repo. The language they use is a bit confusing.

simonw3y ago

The PyTorch model files are already available to download from Hugging Face - the largest one looks to be 52GB. They should run on any hardware that can run regular PyTorch models.

bubblethink3y ago

You can run inference on GPUs. These are just models and weights.

simonw3y ago· 2 in thread

evanmays3y ago

There’s discussion elsewhere in this thread what chinchilla actually means. I’ll only compare it to llama.

Tldr; Chinchilla isn’t wrong, it’s just useful for a different goal than the llama paper.

You can think of this as an optimization function.

Chinchilla says, if you have a fixed amount of compute, here’s what size and number of tokens to train for maximum performance.

This could explain gaps in performance between Cerebras models of size X and llama models of size X. Llama models of size X have way more compute behind them

espadrine3y ago

I don’t think it holds for two reasons.

skybrian3y ago· 1 in thread

What’s in the Pile training data they used? How much source code does it include?

sanxiyn3y ago

https://arxiv.org/abs/2101.00027 is the paper and it includes 95.16 GiB from GitHub.

simonw3y ago· 1 in thread

"Cerebras open sources seven GPT-3 models from 111 million to 13 billion parameters."

I don't understand why they describe them as GPT-3 models here as opposed to calling them GPT models. Or even LLMs - but I guess that acronym isn't as widely recognized.

wsgeorge3y ago

I think GPT-3 is used as a benchmark for performance, so saying a model is on par with GPT-3 should give you an idea of what you can get out of it.

IIRC most open source models to date - including the semi-open LLaMAs - have GPT-3-like performance. Nothing gets close to GPT-3.5 and beyond.

AlexanderTheGr83y ago· 1 in thread

adt3y ago

This is close, table of LLMs as released, and I try and add repos for the 'open' models: https://lifearchitect.ai/models-table/

wg03y ago· 1 in thread

Noob to ML in practice. These models containing weights, all of them, do they have a standard file/binary format?

examplary_cable3y ago

[I'm not an expert] but I believe .ckpt and .safetensors.

dukeofdoom3y ago· 1 in thread

I wonder how decrotive our world will become as a consequence of how cheap it will become to make art using AI.

I kind of want 3d marble statues and baroque art of a future reinasance everywhere. But wonder if we will turn minimalistic as a response.

ericd3y ago

visarga3y ago

> For instance, although Hoffmann et al. (2022) recommends training a 10B model on 200B tokens, we find that the performance of a 7B model continues to improve even after 1T tokens.

Cerebras says:

> For instance, training a small model with too much data results in diminishing returns and less accuracy gains per FLOP

simonw3y ago

You can try out some of these models on Hugging face here: https://huggingface.co/cerebras/Cerebras-GPT-1.3B

That was the largest that had inference enabled - I'd really like to try this one: https://huggingface.co/cerebras/Cerebras-GPT-13B

eternalban3y ago

mark_l_watson3y ago

Even though I usually use OpenAI's APIs, just because that is the easiest path, I do also use Hugging Face open models (via their APIs, and running locally) and I will check out Cerebras also.

Alternatives are good!

ioulaum3y ago

I wonder if they've done some Alpaca style training on it... Granted, what made Alpaca useful was that it was finetuned with GPT-3's instruction following completions as examples.

And, at least officially, OpenAI's outputs can't be used to train other AI models.

Otherwise, if GPT-4 outputs were used to finetune these models, they may become much more interesting.

rbanffy3y ago

fuzzieozzie3y ago

Cerebras has an efficiency advantage at generating LLMs (assuming IP is open). This is going to be fun to be a part of.

ivanvas3y ago

Is it currently possible to find-tune any of the foundation modules available on a few Gb of unsupervised text?

ftxbro3y ago

> Our paper, which will be available soon, will detail our training methods and performance results.

Yay there will be a paper let's gooooooo!

j / k navigate · click thread line to collapse