The cat is out of the bag, the genie is out of the bottle, the confetti has left the cannon[0].
It's tempting to see a world dominated by Google Bard, ChatGPT, Bing Search, etc. And no doubt, they will be huge players, with services that are far more powerful than anything that can be run on the edge.
But. BUT. The things that we can do on the edge are incredible now. Just imagine a year from now, or two. These earth-shattering models, which seem to be upending a whole industry, will soon have equivalents that run on the edge. Without services spying on your data. Without censorship on what the model can/cannot say. Because it's all local.
When was the last time this happened? There will be players who publish weights for models that are free to use. The moment that torrent magnet link is published, it's out in the wild. And smart people will package them as "one click installers" for people who aren't tech-savvy. This is already happening.
So every time you're amazed by something chat-gpt4 says, remember that soon this will be in your pocket.
[0] the "confetti" idiom brought to you by chat-gpt4.
> No results found for "confetti has left the cannon".
I'm amazed that a "stochastic parrot" can come up with such a beautiful idiom.
ME: What's another saying similar to: "The cat is out of the bag" or "The genie is out of the bottle"?
chatgpt: Another similar saying is "the toothpaste is out of the tube." Like the other two expressions, this phrase conveys the idea that something has been revealed or unleashed, and it cannot be undone or put back into its original state.
ME: Can you invent a new phrase that has the same meaning, which is completely new?
chatgpt: Certainly! Here's a new phrase with a similar meaning: "The confetti has left the cannon." This phrase suggests that once something is out in the open or has happened, it's impossible to reverse or contain it, much like confetti that has been shot out of a cannon during a celebration.
ME: What's the name for this kind of expression? Aphorism? Fable?
chatgpt: This kind of expression is known as an idiom. Idioms are phrases or expressions with a figurative meaning that is different from their literal meaning. They are commonly used in everyday language to convey a specific message or sentiment in a more colorful or memorable way.
https://www.instagram.com/p/CQdBiVyh5C2/?hl=en
Now that the cat is out of the bag, or, should I say the confetti is out of the… can?
I thought running something on the edge referred to running it in close network proximity to the user, rather than users having control and running things themselves.
What I find more interesting is that in the classic "close network proximity", some parts of the world may not have benefited as much from that trend since the closest nodes of a global delivery network could be several countries away.
[0] https://github.com/ggerganov/llama.cpp/discussions/205
[1] https://medium.com/sort-of-like-a-tech-diary/consumer-ai-is-...
Somewhat; its consistent with, e.g., Google’s “Edge TPU” designation for its client-side neural processors.
> I thought running something on the edge referred to running it in close network proximity to the user
Typically, but on the client device is the limit-case of “close network proximity to the user”, so the use is consistent.
[1] from https://www.oblomovka.com/wp/2007/08/ at least
Wouldn't that be nice? It would also be contrary to all experience of the outcomes and pulls of corporations in modern society. The "local" LLMs will be on the fringe more than at the edge, because the ones that work the best and attract the most money will be the ones controlled by walled-garden "ecosystems."
I really hope it's different. I really hope there are local models. Actual personal assistants actually designed to assist their users and not the people that provide the access.
> I like to think (right now please!) of a cybernetic forest filled with pines and electronics where deer stroll peacefully past computers as if they were flowers with spinning blossoms.
That is, a real OpenAI with a open government body.
I imagine this "cat out of the bag" situation, the democratization and commodification of powerful technology accessible and affordable to the public, is similar to what's happening with single-board computers and microcontrollers like Raspberry Pi, Arduino, ESP32.
It might be similar to what happened with mobile phones, but there the power was quite restricted. The (mostly) duopoly of iOS and Android, with devices and apps locked down in various ways. Sure we can "jail break" and "root" our phone, but that's not for the general public.
Maybe solar energy production is going through a similar process, with panels and batteries becoming more efficient and affordable every year.
Certainly, it reminds one of the history of personal computers, the way such a powerful general-purpose tool became ubiquitous and local.
These local models are absolutely critical for that to happen though. I'm hitting daily situations where I have to reconsider my use of ChatGPT because what I'm asking would leak very private personal information or somehow trip its morality filter.
Just swapped in a 2TB nvme for a separate /home mount and reinstalled Arch just so I could have enough disk space to test a dozen models locally. I'm so ready!
Empowering everyone equally, very low barrier of entry. The spread of AI will make some skills become less important and other skills more important. There will be painful social changes. Memorising stuff, being exact, being careful, diligent - all these things can be done by LLM. But also creative recombination of ideas. For now we still hold supremacy in long form content, complex software projects and detecting inconsistencies, but they could be emergent qualities in GPT-5.
Integer-quantized LLaMA can fit on a 4090, and isn't quite human-level. If a fully aligned, reliably non-hallucinary turingrade LLM can run on the next generation 5 series GPUs then expect their market price to quadruple or more.
Is the curve of what this class of algorithms can provide sigmoid? If so, then yeah, eventually researchers should be able to democratize it sufficiently that the choice to use versions that can run on private hardware rational. But if the utility increases linearly or better over time/scale, the future will belong to whoever owns the biggest datacenters.
Any projects I can follow? Because I haven't seen any one click installers yet that didn't begin with "first install a package manager on the command line"
Though I have not tried those 1-click installers, instead I have been manually running it.
That project is based on the concept of this Stable Diffusion project: https://github.com/AUTOMATIC1111/stable-diffusion-webui
Which is a few months ahead (because the Stable Diffusion tech happened a few months earlier) and is definitely at a point where anyone can easily run it, locally or on a hosted environment.
I expect this "text-generation-webui" (or something like it) will be just as easy to use in the near, near future.
I want to believe you, but I'm ignorant of the hardware requirements for these things. How soon do you think we'd be able to run something reasonably gpt4-like on, say, a 4090?
The mirror has shattered.
The poop has hit the propeller.
Pandora's box has opened.
https://www.anandtech.com/show/16626/cerebras-unveils-wafer-...
Each individual "chip" has 40GB of SRAM vs ~76MB for the Nvidia H100, and networked pools of external RAM, SSDs and such. Thats why the training architecture is so different.
There's a comparison picture there of one of their chips alongside a regular GPU chip. Effectively they use up the entire wafer.
A 16x CS2 cluster like they describe is like a huge Nvidia cluster in terms of throughput, but more like a single Nvidia node structurally.
Cerebras Second-Gen Wafer Scale Chip: 2.6 Trillion 7nm Transistors, 850,000 Cores, 15kW of Power - https://www.tomshardware.com/news/cerebras-wafer-scale-engin...
Trying to cool 15kW to 20kW of power is also rather impressive. https://www.cerebras.net/cs2virtualtour - the engine block and cooling manifold
> The challenge of extracting more than 20 kW of heat from the wafer was solved by having the wafer "float" on a cold plate. The wafer is allowed to expand and contract while remaining in contact with the polished front side of the cold plate, despite the different coefficients of thermal expansion of copper and silicon. The cold plate is much more than a a slab of metal: advanced computational fluid dynamics modelling was used to design a labyrinth of coolant channels capable of maintaining a precise, stable, thermal environment even as 850,000 Al-optimized cores swing into action.
> The power density of the CS-2 is too high for direct air cooling, so liquid cooling is used instead. The internal manifold transfers heat between the CS-2 system's internal coolant and facilties water. Separating these two fluids ensure that the CS-2 system is not affected by changes in the quality of facilities water and that the very highest-quality coolant circulates through the cold plate.
> The two pump modules plug into the upper four dry-break connectors. The lower two are for the air-cooling or water-cooling heat exchanger.
I'm confused as to why 111 million parameter models are trained with the Chinchilla formula. Why not scale up the training data? If you're training smaller models, surely optimizing performance is better than optimizing total compute.
Seems like a silly misunderstanding of the Chinchilla paper, but I'm sure I'm missing something
Money quote for those who don't want to read the whole thing:
'''
When people talk about training a Chinchilla-optimal model, this is what they mean: training a model that matches their estimates for optimality. They estimated the optimal model size for a given compute budget, and the optimal number of training tokens for a given compute budget.
However, when we talk about “optimal” here, what is meant is “what is the cheapest way to obtain a given loss level, in FLOPS.” In practice though, we don’t care about the answer! This is exactly the answer you care about if you’re a researcher at DeepMind/FAIR/AWS who is training a model with the goal of reaching the new SOTA so you can publish a paper and get promoted. If you’re training a model with the goal of actually deploying it, the training cost is going to be dominated by the inference cost. This has two implications:
1) there is a strong incentive to train smaller models which fit on single GPUs
2) we’re fine trading off training time efficiency for inference time efficiency (probably to a ridiculous extent).
Chinchilla implicitly assumes that the majority of the total cost of ownership (TCO) for a LLM is the training cost. In practice, this is only the case if you’re a researcher at a research lab who doesn’t support products (e.g. FAIR/Google Brain/DeepMind/MSR). For almost everyone else, the amount of resources spent on inference will dwarf the amount of resources spent during training.
'''
I'm not so convinced, especially if people are doing multiple training runs for hyperparameter tuning, cleaning data, fixing bugs, etc.
I would be very interested in knowing what portion of OpenAI's compute budget is training. I would not be surprised if it was a significant minority.
That's only true for general-mass-consumer models.
Companies may want to fine-tune/train their own models, which don't have that many users for their narrow use cases (possibly only internal staff), will find that training cost is a substantial chunk of the TCO
They want you to think it's reasonable that because the line is so straight (on a flops log scale) for so long, it could be tempting to extrapolate the pile-loss consequences of continuing compute-optimal training for larger models beyond their largest 13B one, with the obvious caveat that the extrapolation can't continue linearly much further if for no other reason than the test loss isn't going to go below zero (it will flatten out sooner than that).
If you trained beyond compute-optimality on smaller models, it would mess up their straight line and make it look like we are sooner hitting diminishing returns on test loss.
Isn’t the test loss logarithmic? If so it sure can go below zero.
As an example the BERT/RoBERTa family were trained for much longer than Chinchilla, you do get diminishing returns though.
There is a point of overtraining where downstream performance is impacted but that’s pretty high.
I think part of the answer to this is also that xxx million parameter decoder-only models don’t seem to be that useful so it may not be worthwhile to optimize them for performance?
(it's all blurry)
https://www.cerebras.net/wp-content/uploads/2023/03/Downstre...
https://www.cerebras.net/wp-content/uploads/2023/03/Scaling-...
https://www.cerebras.net/wp-content/uploads/2023/03/Scaling-...
EDIT: Looks like it scores better with less training - up until it matches GPT-J/Pythia/OPT and doesn't appear to have much benefit. It maybe scores slightly better then GPT-J which is pretty "eh", I'm not sure if GPT-J level performance is really useful for anything? NeoX 20B outperforms it in everything if you don't care about the amount of training needed.
Does the better performance for less training matter if that benefit only applies when it's only performing a lot worse then GPT-J? It appears to lose it's scaling benefits before the performance is interesting enough to matter?
edit: scratch that, it seems the AJAX endpoint returns 504 more often that not.
But I'm not sure anymore that it wasn't initially blurry... Perhaps I'm hallucinating, like large language models.
Current image displayed is https://www.cerebras.net/wp-content/uploads/2023/03/Scaling-... , will see if it changes.
Really, really bad mark on whoever is in charge of their web marketing. Images should never look that bad, not even in support, but definitely not in marketing.
edit: so this post is more useful, 4k res using Edge browser
The first one is whether they would actually sue. The optics would be terrible. A similar situation occurred in the 90s when the RC4 cipher’s code was leaked. Everyone used the leaked code pretending that it was a new cipher called arc4random, even though they had confirmation from people that licensed the cipher that its output was identical. Nobody was sued, and the RSA company never acknowledged it.
The second one is related to the terms. The LLaMA weights themselves are licensed under terms that exclude commercial use:[0]
> You will not […] use […] the Software Products (or any derivative works thereof, works incorporating the Software Products, or any data produced by the Software), […] for […] any commercial or production purposes.
But the definition of derivative works is gray. AFAIK, if LLaMA is distilled, there is an unsettled argument to be had that the end result is not a LLaMA derivative, and cannot be considered copyright or license infringement, similar to how models trained on blog articles and tweets are not infringing on those authors’ copyright or licensing. The people that make the new model may be in breach of the license if they agreed to it, but maybe not the people that use that new model. Otherwise, ad absurdum, a model trained on the Internet will have content that was generated by LLaMA in its training set, so all models trained on the Internet after Feb 2023 will break the license.
IANAL, but ultimately, Meta wins more by benefiting from what the community contributes on top of their work (similar to what happened with React), than by suing developers that use derivatives of their open models.
[0]: https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z...
With 96GB of memory you should also be able to fine-tune it (possibly some tricks like gradient accumulation and/or checkpointing might be needed), but you have to be ready for many days of computation...
I was thinking since we have API prices in tokens and now it looks like self hosted inference on high end GPUs for similar models. Then based on electricity prices there will be a self-hosted prices in tokens. Then how close are these already? What is the markup today from roughly the raw electricity cost that OpenAI has.
Model BoolQ PIQA SIQA HellaSwag WinoGrande ARC-e ARC-c OBQA
LLaMa 13B 78.1 80.1 50.4 79.2 73 74.8 52.7 56.4
Cerebras-GPT 13B - 76.6 - 51.3 64.6 71.4 36.7 28.6> It would be interesting to know why you chose those FLOPS targets, unfortunately it looks like the models are quite under pre-trained (260B tokens for 13B model)
> We chose to train these models to 20 tokens per param to fit a scaling law to the Pile data set. These models are optimal for a fixed compute budget, not necessarily "best for use". If you had a fixed parameter budget (e.g., because you wanted to fit models on certain hardware) you would train on more tokens. We do that for our customers that seek that performance and want to get LLaMA-like quality with a commercial license
Which is the point made elsewhere in these comments, e.g. https://news.ycombinator.com/item?id=35344192, and also usefully shows how open Cerebras are. They're pretty open, but not as much as they would be if they were optimising for filling in other companies' moats.
Every new variation of model gets some new name, just like every library gets a new name. There were all kinds of BERTs before - DistilBert, Roberta, SciBERT, Schmobert, Schmuber, etc. Many hundreds of them, I think.
From the Cerebras blog post: "Trained using the Chinchilla formula, these models provide the highest accuracy for a given compute budget."
From the LLaMA paper: "The focus of this work is to train a series of language models that achieve the best possible performance at various inference budgets, by training on more tokens than what is typically used."
Edit: The huggingface page has 0-shot benchmarks which you can compare against the llama paper
Here are some values but I don't know what they mean. LLama 60B on the left, Cerebras 13B on the right.
PiQA: 82.8 / 76.6 WinoGrade: 77.0 / 64.6 ARC-e: 78.9 / 71.4
Which makes me wonder if Nvidia is doing anything with LLMs too?
But it is an issue that their chips are hardly optimized for LLMs.
Nonetheless, it's exciting to see all these open models popping up, and I hope that a LLM equivalent to Stable Diffusion comes sooner than later.
For a small organization or individual who is technically competent and wants to try and do self-hosted inference.
What open model is showing the most promise and how does it’s results compare to the various openAI GPTs?
A simple example problem would be asking for a summary of code. I’ve found openAI’s GPT 3.5 and 4 to give pretty impressive english descriptions of code. Running that locally in batch would retain privacy and even if slow could just be kept running.
Sadly, there's no open model yet that acts like a Swiss knife and gets good-enough results for multiple use cases.
I wish rather than stopping training early they would have run more data through a small model so we could have something more competitive with LLaMA 7B.
"We chose to train these models to 20 tokens per param to fit a scaling law to the Pile data set. These models are optimal for a fixed compute budget, not necessarily "best for use". If you had a fixed parameter budget (e.g., because you wanted to fit models on certain hardware) you would train on more tokens. We do that for our customers that seek that performance and want to get LLaMA-like quality with a commercial license"
I'd chip in!
This is really weird to hear out loud.
I still think of Discord as a niche gaming chatroom, even though I know that (for instance) a wafer scale IC design company is hosting a Discord now.
I remember seeing news about the enormous chip Cerebras was/is selling (pdf https://f.hubspotusercontent30.net/hubfs/8968533/WSE-2%20Dat...).
Has there been any indication that the LLMs released in the last few months use exotic hardware like this, or is it all "standard" hardware?
[1] https://www.hpcwire.com/2021/09/16/cerebras-wafer-scale-engi....
EDIT: Ok, looks like I've missed the hugging face repo. The language they use is a bit confusing.
Tldr; Chinchilla isn’t wrong, it’s just useful for a different goal than the llama paper.
There’s 3 hyper parameters to tweak here. Model size (parameter count), number of tokens pre trained on, and amount of compute available. End performance is in theory a function of these three hyperparameters.
You can think of this as an optimization function.
Chinchilla says, if you have a fixed amount of compute, here’s what size and number of tokens to train for maximum performance.
A lot of times, we have a fixed model size though though, because size impact inference costs and latency. Llama operates in this territory. They choose to fix the model size instead of the amount of compute.
This could explain gaps in performance between Cerebras models of size X and llama models of size X. Llama models of size X have way more compute behind them
First, it only holds for a given architecture and implementation. Obviously, a different architecture will have a different training slope. This is clear when comparing LSTM with Transformers, but is also true between transformers that use prenorm/SwiGLU/rotary-positional, and those that follow Vaswani 2017.
In terms of implementation, some algorithms yield the same result with fewer operations (IO, like FlashAttention and other custom CUDA kernels, and parallelism, like PaLM, which both came after Chinchilla), which unambiguously affect the Tflops side of the Chinchilla equation. Also, faster algorithms and better parallelization will yield a given loss sooner, while less power-hunger setups will do that cheaper.
Second, even in the original Chinchilla paper in figure 2, some lines are stopped early before reaching Pareto (likely because it ran out of tokens, but LLaMA makes it seem that >1 epoch training is fine).
I don't understand why they describe them as GPT-3 models here as opposed to calling them GPT models. Or even LLMs - but I guess that acronym isn't as widely recognized.
IIRC most open source models to date - including the semi-open LLaMAs - have GPT-3-like performance. Nothing gets close to GPT-3.5 and beyond.
The problem with .ckpt is that it executes arbitrary code in your machine(very unsafe). While .safetensors was made by huggingface in order to have a safe format to store the weights. I've also seen people load up the llama 7B via a .bin file.
I kind of want 3d marble statues and baroque art of a future reinasance everywhere. But wonder if we will turn minimalistic as a response.
> For instance, although Hoffmann et al. (2022) recommends training a 10B model on 200B tokens, we find that the performance of a 7B model continues to improve even after 1T tokens.
Cerebras says:
> For instance, training a small model with too much data results in diminishing returns and less accuracy gains per FLOP
But this is only of concern when you care about the training cost, such as when you are budget limited researcher or a company who doesn't deploy models at scale. But when you care about the total cost of deployment, then making a small model even better with lots of data is a smart move. In the end it matters more to have the most efficient model in prediction, not the most efficient model in training.
That was the largest that had inference enabled - I'd really like to try this one: https://huggingface.co/cerebras/Cerebras-GPT-13B
This is called a silver lining for some (in case you were worried about gpt taking your job). Privacy requirements alone will in the near term force major companies to run their own inference (if not training). The expertise required are nearly identical to that of running large scale distributed computational graphs.
This is an interesting diveragence from what happened with web. The backends started out simple before map-reduce and before deconstructing databases and processing distributed logs. With ML, we'll jump right into the complex backends in tandem with easy-picking early stage edge applications (which we see daily on HN).
Alternatives are good!
And, at least officially, OpenAI's outputs can't be used to train other AI models.
Otherwise, if GPT-4 outputs were used to finetune these models, they may become much more interesting.
Yay there will be a paper let's gooooooo!