I don't want to use GPT since the project will be using personal information to train/fine tune the models.
LLMs take so much engineering effort, research, and compute that it's unlikely there will be good open source alternatives in the near future. Right now your only real option is OpenAI (or maybe Anthropic) and that seems unlikely to change anytime soon.
The only reason we have LLAMA is because Meta threw us a bone. They might not do that again.
> The only reason we have LLAMA is because Meta threw us a bone
IMO, this is pretty inaccurate, you can look at my other post in the thread to see how many other recent and ongoing projects there are. The training data sets (The Pile, The Stack, LAION, etc) are publicly available and have been shown to be able to train very high quality models (and some groups committed to open models like Stability AI and Hugging Face are fairly well capitalized).
Training and fine-tuning costs are both getting better and costs are droping ridiculously fast (fine tunes went from spending thousands, to hundreds, and now to about $10 in the span of weeks). There are new optimizations and techniques being published every day (almost all of it reproducible, most w/ a code repos).
For new foundational models, Cerebras and others now will happily do built-to-order ones for a flat fee, but I suspect all kinds of well-funded EDUs, research labs, corporations, maybe even nation states will continue to train/release new cutting edge models w/ permissive licenses.
does it though? it looks more like it requires a lot of money for compute and a lot of money and data for parameter tuning, but engineering effort seems soso.
except for the compute cost this is perfect application for a distributed open source labeling effort.
Just for my understanding though, are the data sets full of copyrighted material?
The irony is that openAI and Meta themselves might be in flaky ground for having trained models on other people data with dubious rights to do so in many instances, and then using it to produce output commercially.
But this is a new frontier and enforcement might be effectively not possible unless new legislation requires reproducibility and audits on the data sets or something like that.
But without that, how do you know exactly how did they arrive at a given set of weights with Montecarlo algorithms and arbitrary fine tuning? You basically don't know what was there and you cannot prove they didn't achieve those results with perfectly clean data.
PS: https://medium.com/geekculture/list-of-open-sourced-fine-tun...
One could use chatgpt / gpt4 to create better training material for those models, even if not allowed. In that sense there is an advantage to being second here.
I can see a not-too-distant future where initial "base" models (like LLaMA) are released by such entities that do have the resources as they are seen as foundational enablers of the ecosystem (roughly equivalent to the Linux kernel or possibly Torch/Tensorflow/Transformers) where the "real" (differentiating) value from a commercial standpoint is something like 5-10 layers up the stack. The tremendous amount of value afforded by something like a Linux distribution isn't in the kernel, some random library, nginx, docker, etc. When you look hardware up almost everything you see on HN is 90-99% the same code, frameworks, toolkits, etc.
Then, a wide diaspora of commercial, academic, etc interests and other collaborators scratch their own itches and push the needle forward. Some release to the public, some don't but at a certain scale the combined effort easily exceeds the resources available to even a large, well funded entity like OpenAI. I've talked about it before but the last study I could find from 2008 analyzed Fedora 9 and estimated it represented something like $10b in combined dev cost.
There are also such rapid advancements in finetuning models in limited VRAM environments, quantization, applying them to specific use-cases, tooling, etc that the barrier of entry to iterate, build on, and actually use something like LLaMA is no longer 100 A100s (or whatever) and a dedicated large team. If you run apt-get install $SOMETHINGBIG and it grabs dozens of dependencies you're never heard of it starts to drive this point home.
I'm working on a project to be announced/released soon that in the end is something like > 100 python dependencies and other misc enabling packages, frameworks, tools, etc that it ends up being a 12GB docker image. Our "magic", meanwhile, is something like 1k LoC.
The biggest hole in this position is that it could be viewed releasing a model and weights is the equivalent of releasing your application and data itself but back to your original point I don't see the entire world bifurcating into multi-billion dollar startups and "everyone else".
Or maybe I'm just being optimistic :).
Was a bit weird they mentioned eliza/gpt-j i think on it but didnt make much sense to me?
did that happen or just hallucinated?
These models perform slightly better than GPT-3 under some tasks[2], but they're still far from achieving the results from GPT-3.5 and GPT-4. This becomes evident when you try to use them in the real world; they're not "good enough" for general use cases, unlike ChatGPT models. However, if you can restrict your use case to one particular domain, you can achieve pretty good results by further fine-tuning these models.
[0]: https://huggingface.co/google/flan-t5-xxl
[1]: https://huggingface.co/google/flan-ul2
[2]: https://paperswithcode.com/sota/multi-task-language-understa...
I'd love any alternative view points of this.
Can't prove the alternative's financial model without showing the thing to real users. Can't know in advance, if the new financial model will be pennies on adwords' dollars.
Here's the ironic thing with Google and stuff like ChatGPT: asking ChatGPT something is like getting paragraphs of text similar to if someone read and summarized top X results from a search engine, without ads. If someone built a browser plugin to replace your URL bar/search engine with ChatGPT (and had it output links to references) how disruptive would that be to their cash cow?
Not yet mentioned:
* Pythia https://github.com/EleutherAI/pythia
* GLM-130B https://github.com/THUDM/GLM-130B - see also ChatGLM-6B https://github.com/THUDM/ChatGLM-6B
* GPT-NeoX-20B https://huggingface.co/EleutherAI/gpt-neox-20b
* GeoV-9B https://github.com/geov-ai/geov
* BLOOM https://huggingface.co/bigscience/bloom and BLOOMZ https://huggingface.co/bigscience/bloomz
Q: hello, who are you? A: I was twitted. In fact, twitted, or twittered, is one of those tweets that one either never sees or sees right away. Twitter is a website that allows users to post short messages that can be read and retweeted by other users. These messages are called tweets. So, who are you? Oh, hello, who are you? I was twitted. In fact, twitted, or twittered, is one of those tweets that one either never sees or sees right away. Twitter is a website that allows users to post short messages that can be read and retweeted by other
> You will not use, copy, modify, merge, publish, distribute, reproduce, or create derivative works of the Software, in whole or in part, for any commercial, military, or illegal purposes.
A GPT has no training until you give it materials. I do believe Google released the code for theirs ages ago. Even without source, you can run a GPT against your own data locally, or on a cloud service setup for that purpose.
This is how Bloomberg, for example, created a financial LLM. They used a GPT to train on their own financial data.
So there are two cost factors... the cost of paying someone else to train and host the regular LLM part + yours, or the cost of setting up the (virtual) hardware and compute time to train and host those things on your own.
One "middle road" that might for some applications is to use the OpenAI API (for example) to combine access to your own data in real time (via your private APIs) with the natural language understanding that's already present in the LLM. These are the plug-ins that are quickly taking over HN, many without any great utility on their own. But you can see that a pre-trained LLM plus access to your own data privately might very well be worth paying for.
AFAIK full model training should be a couple order magnitudes higher probably?
Id be wary of just hacking away without understanding at least the fundamentals of ML + NLP or you'll find yourself lost pretty quick.
I'm a former SWE turned NLP researcher, so i was recently in your position:)
- karpathy - https://www.youtube.com/watch?v=kCc8FmEb1nY
- https://towardsdatascience.com/beautifully-illustrated-nlp-m...
- https://dzone.com/articles/a-deep-dive-into-the-transformer-...
- https://peterbloem.nl/blog/transformers
- http://nlp.seas.harvard.edu/2018/04/03/attention.html
- https://lilianweng.github.io/posts/2023-01-27-the-transforme...
- https://blog.quickchat.ai/post/tokens-entropy-question/
- https://dugas.ch/artificial_curiosity/GPT_architecture.html
- https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-...
- https://d4mucfpksywv.cloudfront.net/better-language-models/l...
- https://arxiv.org/pdf/2005.14165.pdf
Source: https://help.openai.com/en/articles/5722486-how-your-data-is...
Probably can give directions where a software engineer can start to understand the concept.
Machine Learning models are made the same way machine learning output is generated.
In other words, the old model is training data to the new model. Just like the pirated torrent site dataset "Books3" Facebook used to train LLaMA is training data.
If Facebook can protect their model under copyright then every publisher in existence sue Facebook into the ground. They can't have it both ways.
This is a logical conclusion. But if it actually holds, that's for the courts to decide
I'm working on a package to help evaluate LLM results across different LLMs (e.g., GPT3.5 vs. GPT4 vs. Dolly 2 vs...); if you are looking to run experiments to compare results, I'd love to help you out. You can email me at w (at) phaseai (dot) com.
https://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-...
Commercial product sure can be built on top of LLAMA, it's GPL-3. Your models are your own; just patches, modifications, and code you link to LLMA itself will be governed by the GPL as well.
This is almost certainly what you want since this way you can use patches, fixes, and improvements others make to LLMA. You won't have to do all that work yourself, or necessarily wait for Facebook.
https://www.heise.de/news/Open-source-AI-LAION-proposes-to-o...
I think many of us have the same need and are waiting for open AI plug-in access.
Is this the question we are asking yourselves here or are we talking about licensing?
I thought the opt series can be used in production