The one-l lama,
He's a priest.
The two-l llama,
He's a beast.
And I will bet
A silk pajama
There isn't any
Three-l lllama.As non-native English speaker (while though a parent of a toddler too) I wasn't familiar with the book series.
There's really only one thing I care about: How does this compare to GPT-4?
I have no use for models that aren't at that level. Even though this almost definitely isn't at that level, it's hard to know how close or far it is from the data presented.
The big story here for me is that the difference in training set is what makes the difference in quality. There is no secret sauce, the open source architectures do well, provided you give them a large and diverse enough training set. That would mean it is just a matter of pooling resources to train really capable open source models. That makes what RedPajama is doing, compiling the best open dataset, very important for the future of high quality open source LLM’s.
If you want to play around with this yourself you can install oobabooga and figure out what model fits your hardware from the locallama reddit wiki. The llama.cpp 7B and 13B models can be run on CPU if you have enough RAM. I’ve had lots of fun talking to 7B and 13B alpaca and vicuna models running locally.
It's really fun to enable both the whisper extension and the TTS extension and have two-way voice chats with your computer while being able to send it pictures as well. Truly mind bending.
Quantized 30B models run at acceptable speeds on decent hardware and are pretty capable. It's my understanding that the open source community is iterating extremely fast on small model sizes getting the most out of them by pushing the data quality higher and higher, and then they plan to scale up to at least 30B parameter models.
I really can't wait to see the results of that process. In the end you're going to have a 30B model that's totally uncensored and is a mix of Wizard + Vicuna. It's going to be a veryyyy capable model.
Bigger ones as well, you just have to wait longer. Nothing for real time usage, but if you can wait 10-20 minutes, you can use them on CPU.
For example a therapist, a search bot for you diary, a company intranet help bot. Anything where the prompt contains something you don’t want to send to a third party.
Thanks!
Assume a truly competitive model in the Open Source world is still a ways off. These teams and their infrastructure are still in their early days while OpenAI is more at the fine-tuning and polishing stage. The fact that these open teams are able to have something in the same universe in terms of functionality this fast is pretty amazing... but it will take time before there's an artifact that will be a strong competitor.
I'll give you the answer for every open source model over the next 2 years: It's far worse
I suspect Open Source LLMs will outpace the release version of GPT-4 before the end of this year.
It's less likely they will outpace whatever version of GPT-4 is shipped later this year, but still very much possible.
Open source models can already approximate GPT-3.5 for most tasks on common home hardware, right now.
On one hand, the resources required to run these models continues falling dramatically, thanks to the techniques discovered by researchers: GPTQ quantizing down to 4, 3, 2, even 1 bits! model pruning! hybrid vram offloading! better, more efficient architectures! 1-click finetuning on consumer hardware! Of course, the free lunches won't last forever, and this will level off, but it's still incredible.
And on the other side of the coin, the power of all computing devices continues its ever-upward exponential growth.
So you have a continuous lowering of requirements, combined with a continuous increase in available power... surely these two trends will collide, and I can only imagine what this stuff will be like at that intersection.
Furthermore, model size is still the most significant contributor to output quality. E.g. vanilla llama-30b at 4-bit has better perplexity than any llama-13b finetune at 8-bit. Thus, if 4-bit lets you fit a larger model into available (V)RAM, you're still better off.
This is also why analog computing is seriously considered as a hardware architecture for LLMs: if you don't actually need bit-perfect matmul for things to work well, it can be done much simpler as an analog circuit, and then you can cram a lot more of them on the same chip. Any resulting quality loss would presumably be minor, and in any case would be more than compensated by the much larger model sizes allowed by such architecture.
The weights scale the output values from the previous layer, and the weighted values are summed. So it seems to me, instead of having a high-precision weight scale a single output, if you cloned the node in the previous layer M times, you could still have sqrt(M) bits of precision with 1-bit weights (or M bits, my brain is in weekend mode).
Thus a larger network with lower-precision weights should have the ability to have approximately the same precision as a smaller network with high-precision weights.
The larger network has more interconnects though, so seems like it could allow for more interesting space to explore during training, leading to better results.
Then again, I could be entirely wrong.
We’re finding out that many models are undertrained for their sizes, and a good option is to post process them into smaller models by teaching a smaller model to mimic their output. Quantization effectively cuts down the model size as well. No loss in quality means that the model has not been trained enough to take advantage of the depth of precision that is available.
We can use GPS to locate anything down to a sliding scale of decimal precision. There are only so many digits you need to locate a city or even a house.
As the resouces required to train and fine tune these models becomes consumer handware friendly, I think we'll see a shift towards a bunch of smaller models. Open models like these also mean the results of securty and capability research is publicly available. Models like this one and the Replit code model will become the new base all open source models are based on. I am really looking forward to the gptj 4bit, cuda optimized 7b models, the others I have tested run fast on 2070max q and 16gb ram, I was getting ~7tokens/second. Lora can work directly with 4bit quantized models. While ggml, cpu models are very strong, I don't believe we're move away from gpu accelarated training and fine tuning anytime soon.
LLaMA’s main issue is that its license prevents commercial use.
If you want to use a LLM inside of a product, you may need to internationalize it at some point, so multilingual support matters.
Let's wait for someone to port it to a cheaper and more powerful C-based engine like llama-cpp.
build a model that can change the number of parameters in the vicinity of some meaning, effectively increasing the local resolution around that meaning
so parameter space becomes linked-parameter space, between models
links could be pruned based on activation frequency
another way of seeing the concept is a tree of models/llms
and one additional model/llm that all it does is manage the tree (ie. build it as it goes, use it to infer, prune it, etc)
Or is it too dumb what I’m saying?
The 3B model, being super fast and accessible, is a game changer for a lot of us who may not have the latest hardware. I mean, running on an RTX 2070 that was released 5 years ago? That's pretty cool.
As for the 7B model, it's great to see that it's already outperforming the Pythia 7B. The bigger dataset definitely seems to be making a difference here. I'm eager to see how far this project goes, and what kinda improvements we can expect in the coming weeks with the new RedPajama dataset they're working on.
One thing I found interesting is the mention of differences between the LLaMA 7B and their replication. I'd love to learn more about those differences, as it could shed light on what's working well and what could be improved further.
I played with a pirated 7B model a while back. My computer runs a 1080 TI - so it used to be good but now it's pretty old. The model ran with a reasonable number of tokens/sec, but the quality was just trash compared to what I'd grown used to with ChatGPT. It was a novelty I interacted with for just a single evening.
I truly don't understand the use case for a 3B model with our current technologies.
What are you going to use it for?
Also, ChatGPT just can't do a lot of things because of their "rules". I was doing question answering about products on Amazon with ChatGPT and refused to answer any questions about underwear, certain books/videos, etc
Would the way the m2 MacBooks share memory be an advantage, or would the lack of cuda support be a killer? Can you do anything with 16GB, or do you need 128gb or something like that? How large are the datasets?
I've only used scikit-learn and pandas so far, I'm not very familiar with neural networks yet
Sure, you may have played with a 7B model in the past, but that doesn't mean there's no use case for a smaller model like the 3B. In fact, having a performant, smaller model is a game changer for a lot of applications that don't require the massive scale of the larger models. Plus, smaller models are generally faster and more accessible, which is always a plus.
I find it very uncanny to see comments like this that sound like ChatGPT but are surprisingly relevant to the discussion.