Could you train a ChatGPT-beating model for $85k and run it in a browser? (opens in new tab)

(simonwillison.net)

430 pointssirteno3y ago170 comments

170 comments

122 comments · 34 top-level

whalesalad3y ago· 19 in thread

Are there any training/ownership models like Folding@Home? People could donate idle GPU resources in exchange for access to the data, and perhaps ownership. Then instead of someone needing to pony up $85k to train a model, a thousand people can train a fraction of the model on their consumer GPU and pool the results, reap the collective rewards.

dekhn3y ago

A few people have built frameworks to do this.

There is still a very large open problem in how to federate large numbers of loosely coupled computers to speed up training "interesting" models. I've worked in both domains (protein folding via Folding@Home/protein folding using supercomputers, and ML training on single nodes/ML training on supercomputers) and at least so far, ML hasn't really been a good match for embarrassingly parallel compute. Even in protein folding, folding@home has a number of limitations that are much better addressed on supercomputers (for example: if your problem requires making extremely long individual simulations of large proteins).

All that could change, but I think for the time being, interesting/big models need to be trained on tightly coupled GPUs.

itissid3y ago

And you can rule out most of the monte carlo stuff too. Which rules out parallelization modern statistical frameworks like STAN used for explainable models; things like Finance modeling of risk which is a sampling of posteriors using MCMC also can't be parallelized.

1 more reply

whalesalad3y ago

Probably going to mirror the transition from single-threaded to multi-threaded compute. Took a while until application architectures took hold of the populous to utilize multi-core.

1 more reply

mirekrusin3y ago

Unfortunately training is not emberassingly parallelisable [0] problem. It would require new architecture. Current models diverge too fast. By the time you'd download and/or calculate your contribution the model would descend somewhere else and your delta would not be applicable - based off wrong initial state.

It would be great if merge-ability would exist. It would also likely apply to efficient/optimal shrinking for models.

Maybe you could dispatch tasks to train on many variations of similar tasks and take average of results? It could probably help in some way, but you'd still have large serialized pipeline to munch through and you'd likely require some serious hardware ie. dual gtx 4090 on client side.

[0] https://en.wikipedia.org/wiki/Embarrassingly_parallel

amitport3y ago

hmmm... seems like you're reinventing distributed learning.

merge-ability does exist and you can average the results.

2 more replies

spyder3y ago

Learning@Home using Decentralized Mixture-of-Expert models:

https://learning-at-home.github.io/

https://training-transformers-together.github.io/

https://arxiv.org/abs/2002.04013

ftxbro3y ago

Yes there is petals/bloom https://github.com/bigscience-workshop/petals but it's not so great. Maybe it will improve or a better one will come.

riedel3y ago

I read that it is only scoring the model collaboratively but it allows some fine-tuning I guess.

Getting the actual gradient descent to parallelize is more difficult because one needs to average the gradient when using data/batch parallelism. It becomes more a network speed than GPU speed problem. Or are LLMs somehow different?

whalesalad3y ago

Really interesting live monitor of the network: http://health.petals.ml

polishdude203y ago

I wonder how they handle illegal content. Like, if you're running training data on your computer, what's to stop someone else's data that is illegal, from being uploaded to your computer as part of training?

ellisv3y ago

That’d be cool but I don’t think most idle consumer GPUs (6-8GB) would have large enough memory for a single iteration (batch size 1) of modern LLMs.

But I’d love to see more federated/distributed learning platforms.

mirekrusin3y ago

6GB can store 3 billion parameters, gpt3.5 has 175 billion parameters.

whalesalad3y ago

Is it possible to break the model apart? Or does the entire thing need to be architected from the get-go such that an individual GPU can own a portion end to end?

1 more reply

semitones3y ago

The main reason an arbitrarily distributed set of compute nodes cannot give you good performance for training a model (even if you have an immodest number of nodes), is that the latency of the inter-node communication will be a massive bottleneck. GPU cloud providers shell out big bucks for ultra fast intra-DC networking via infiniband and the like, and the networking is paid attention to as much (if not more sometimes) than the capabilities of the nodes themselves.

neoromantique3y ago

How long until somebody creates a crypto project on that?

buildbuildbuild3y ago

Bittensor is one, not an endorsement. chat.bittensor.com

_trampeltier3y ago

Start a Boinc project.

https://boinc.berkeley.edu/projects.php

peter3033y ago

Every parameter needs to reach every other parameter. Ideally enough core memory for that. But their tiling algorithms.

cleanchit3y ago

This is how you get skynet.

version_five3y ago· 12 in thread

If you have ~100k to spend, aren't there options to buy a gpu rather than just blow it all on cloud? How much is an 8xA100 machine?

4xA100 is 75k, 8 is 140k https://shop.lambdalabs.com/deep-learning/servers/hyperplane...

sacred_numbers3y ago

If you bought an 8xA100 machine for $140k you would have to run it continuously for over 10,000 hours (about 14 months) to train the 7B model. By that time the value of the A100s you bought would have depreciated substantially; especially because cloud companies will be renting/selling A100s at a discount as they bring H100s online. It might still be worth it, but it's not a home run.

inciampati3y ago

If 8-bit training methods take off, I think the calculus is going to change rapidly, with newer cards that have decent amounts of memory and 8-bit acceleration starting to become dramatically more cost and time effective than the venerable A100s.

dekhn3y ago

you're comparing the capital cost of acquiring a GPU machine with the operational cost of renting one in the cloud.

Ignoring the operational costs of on-prem hardware is pretty common, but those costs are significant and can greatly change the calculation.

capableweb3y ago

Heh, you work at AWS or Google Cloud perhaps? ;) (Only joking about this as I constantly see employees from AWS/GCloud and other cloud providers claim that cloud is always cheaper than hosting things yourself)

Sure, if you're planning to service a large number of users, building your infrastructure in-house might be a bit overkill, as you'll need a infrastructure team to service it as well.

If you're just want to buy 4 GPUs to put in one server to run some training yourself, I don't think it's that much overkill. Especially considering you can recover much of the cost even after a year by selling much of the equipment you bought. Most of your losses will be costs for electricity and internet connection.

2 more replies

jillesvangurp3y ago

The issue with on premise is under utilization and the fact that you need more than just the hardware. You end up buying more hardware than you need and inevitably a portion of it will just sit there idling and depreciating in value. And you don't just need hardware but also investments in your building. GPUs generate a lot of heat. So, you need to get rid of that heat and make sure you beef up your power infrastructure to be able to handle the load. It's not just the GPUs that you pay for. And the equipment is expensive. So you need to invest in security as well.

Cloud pricing is pretty steep and obviously has a fat profit margin but building your own data centers isn't cheap either. Doing this at scale is not something most companies would be very good at either. Which means it probably is quite a bit more expensive relative to what the big cloud providers are doing.

pessimizer3y ago

Or from another perspective, comparing the cost of training one model in the cloud to the cost of training as many as you want on your machine, then (as mentioned by siblings) selling the machine for nearly as much as you paid for it, unless there's some shortage, in which case you'll get more back than you paid for it.

One is buying capital that produces models, the other is buying a single model.

digitallyfree3y ago

For a single unit one could have it in their home or office, rather than a datacenter or colo. If the user sets up and manages the machine themselves there is no additional IT cost. The greatest operating expense would be the power cost.

1 more reply

version_five3y ago

For a server farm, sure, for one machine, I don't know. Assuming it plugs into a normal 15A circuit, and you have a we-work or something where you don't pay for power, is the operational cost of one machine really material?

1 more reply

jcims3y ago

No kidding. I worked for a company that had multiple billions of dollars invested in a data center refresh in North America and Europe.

sounds3y ago

Remember to discount the tax depreciation for the hardware and deduct any potential future gains from either reselling it or using it.

modernpink3y ago

You can sell the A100 after once you're done as well. Possibly even at profit?

girthbrooks3y ago

These are wild pieces of hardware, thanks for linking. I wonder how loud they get.

ftxbro3y ago· 7 in thread

His estimate is that you could train a LLaMA-7B scale model for around $82,432 and then fine-tune it for a total of less than $85K. But when I saw the fine tuned LLaMA-like models they were worse in my opinion even than GPT-3. They were like GPT-2.5 or like that. Not nearly as good as ChatGPT 3.5 and certainly not ChatGPT-beating. Of course, far enough in the future you could certainly run one in the browser for $85K or much less, like even $1 if you go far enough into the future.

simonw3y ago

Yeah, you're right. I wrote this a couple of weeks ago at the height of LLaMA hype, but with further experience I don't think the GPT-3 comparisons hold weight.

My biggest problem: I haven't managed to get a great summarization out of a LLaMA derivative that runs on my laptop yet. Maybe I haven't tried the right model or the right prompt yet though, but that feels essential to me for a bunch of different applications.

I still think a LLaMA/Alpaca fine-tuned for the ReAct pattern that can execute additional tools would be a VERY interesting thing to explore.

[ ReAct: https://til.simonwillison.net/llms/python-react-pattern ]

avereveard3y ago

my biggest problem with these models is that they cannot reliably produce structured data.

even davinci can be used as part of a chain, because you can direct it to structure and unstructure data, and then extract the single component and build them into tasks. cohere, llama et al are currently struggling to consistently produce these result reliably, even if you can chat with them and frankly it's not about the chat

example from a stack overflow that split the questions before sending it down chain for answering all points individually:

This is a customer question:

I'm a beginner RoR programmer who's planning to deploy my app using Heroku. Word from my other advisor friends says that Heroku is really easy, good to use. The only problem is that I still have no idea what Heroku does...

I've looked at their website and in a nutshell, what Heroku does is help with scaling but... why does that even matter? How does Heroku help with:

    Speed - My research implied that deploying AWS on the US East Coast would be the fastest if I am targeting a US/Asia-based audience.

    Security - How secure are they?

    Scaling - How does it actually work?

    Cost efficiency - There's something like a dyno that makes it easy to scale.

    How do they fare against their competitors? For example, Engine Yard and bluebox?

Please use layman English terms to explain... I'm a beginner programmer.

Extract the scenario from the question including a summary of every detail, list every question, in JSON:

{ "scenario": "A beginner RoR programmer is planning to deploy their app using Heroku and is seeking advice about deploying it.", "questions": [ "What does Heroku do?", "How does deploying AWS on the US East Coast help with speed?", "How secure is Heroku?", "How does scaling with Heroku work?", "What is a dyno and why is it cost efficient?", "How does Heroku compare to its competitors, such as Engine Yard and Bluebox?" ] }

2 more replies

Tepix3y ago

Have you tried bigger models? Llama-65B can indeed compete with GPT-3 according to various benchmarks. The next thing would be to get the fine-tuning as good as OpenAI's.

1 more reply

icelancer3y ago

Yeah, the constant barrage of "THIS IS AS GOOD AS CHATGPT AND IS PRIVATE" screeds from LLaMA-based marketing projects are getting ridiculous. They're not even remotely close to the same quality. And why would they be?

I want the best LLMs to be open source too, but I'm not delusional enough to make insane claims like the hundreds of GitHub forks out there.

robertlagrant3y ago

> I want the best LLMs to be open source too

How do you do this without being incredibly wealthy?

5 more replies

SomewhatLikely3y ago

The crazy thing to me is that this means we're approaching being able to have a huge chunk of human knowledge just sitting there locally on your machine. I asked ChatGPT 4 about my old professor and it was able to write a few paragraphs on her including some very specific details. It's like you can fit most of the value of a search engine AND the retrieved pages into a quite small hardware footprint.

hnav3y ago

it can't be factual though, otherwise you'll have found compression with infinite ratio. I think the next step is a model that can say "idk" rather than coming up with bullshit

rspoerri3y ago· 7 in thread

So cool it runs on a browser /sarcasm/ i might not even need a computer. Or internet when we are at it.

It either runs locally or it runs on the cloud. Data could come from both locations as well. So it's mostly technically irrelevant if it's displaying in a browser or not.

Except when it comes to usability. I don't get it why people love software running in a browser. I often close important tools i have not saved when it's in a browser. I cant have offline tools which work if i am in a tunnel (living in Switzerland this is an issue) . Or it's incompatible because i am running LibreWolf.

/sorry to be nitpicking on this topic ;-)

ftxbro3y ago

> I don't get it why people love software running in a browser.

If you read the article, part of the argument was for the sandboxing that the browser provides.

"Obviously if you’re going to give a language model the ability to execute API calls and evaluate code you need to do it in a safe environment! Like for example... a web browser, which runs code from untrusted sources as a matter of habit and has the most thoroughly tested sandbox mechanism of any piece of software we’ve ever created."

rspoerri3y ago

Thinking about it...

I don't know exactly about the browser sandboxing. But isn't it's purpose to prevent access to the local system, while it mostly leaves access to the internet open?

Is that really a good way to limit and AI system's API access?

1 more reply

rspoerri3y ago

OSX does app sandboxing as well (not everywhere). But yeah, you're right i only skimmed the content and missed that part.

pmoriarty3y ago

There are a bunch of reasons people/companies like web apps:

1 - Everyone already has a web browser, so there's no software to download (or the software is automatically downloaded, installed and run, if you want to look at it that way... either way, the experience is a lot easier and more seamless for the user)

2 - The website owner has control of the software, so they can update it and manage user access as they like, and it's easier to track users and usage that way

3 - There are a ton of web developers out there, so it's easier to find people to work on your app

4 - You ostensibly don't need to rewrite your app for every OS, but may need to modify it for every supported browser

rspoerri3y ago

Most of these aspects make it better for the company or developer, only in some cases it makes it easier for the user in my opinion. Some arguments against it are:

1 - Not everyone has or wants fast access to the internet all the time.

2 - I try to prevent access of most of the apps to the internet. I don't want companies to access my data or even metadata of my usage.

3 - sure, but it doesn't make it better for the user.

4 - Also supporting different screen sizes and interaction types (touch or mouse) can be a big part of the work.

The most important part for a user is if he/she is only using the app rarely or once. Not having to install it will make the difference between using it or not. However with the app stores most OS's feature today this can change pretty soon and be equally simple.

I might be old school on this, but i resent subscription based apps. For applications that do not need to change, deliver no additional service or aren't absolutely vital for me i will never subscribe. And browser based app's are at the core of this unfortunate development. But that's gone very far from the original topic :-)

sp3323y ago

Broswer software is great because I don't have to build separate versions for Windows, Mac, and Linux, or deal with app stores, or figure out how to update old versions.

nanidin3y ago

Browser is the true edge compute.

agnokapathetic3y ago· 6 in thread

> My friends at Replicate told me that a simple rule of thumb for A100 cloud costs is $1/hour.

AWS charges $32/hr for an 8xA100s (p4d.24xlarge) which comes out to $4/hour/gpu. Yes you can get lower pricing with a 3 year reservation but thats not what this question is asking.

You also need 256 nodes to be colocated on the same fabric -- which AWS will do for you but only if you reserve for years.

thewataccount3y ago

AWS certainly isn't the cheapest for this, did they mention using AWS? Lamdba Labs is 12$/hr for 8xA100's, and there's others relatively close to this price on demand, I assume you can get a better deal if you contact them for a large project.

Replicate themselves rent out GPU time so I assume they would definitely know as that's almost certainly the core of their business.

sebzim45003y ago

Maybe they are using spot instances? $1/hr is about right for those.

celestialcheese3y ago

lambdalabs will let you do on-demand 8xa100 @ 80GB VRAM/GPU for $12/hr, or reserved @ $10.86/hr

8xA100 @ 40gb for $8/hr

Replicate friend isn't far off.

pavelstoev3y ago

model-depending, you can train on lesser (cheaper) GPUs but system-level optimizations are needed. Which is what we provide at centml.ai

IanCal3y ago

Lambda labs charges about 11-12/hr for 8xA100.

robmsmt3y ago

and is completely at capacity

1 more reply

lxe3y ago· 5 in thread

Keep in mind that image transformer models like stable diffusion are generally smaller than language models, so they are easier to fit in wasm space.

Also. you can finetune llama-7b on a 3090 for about $3 using LoRA.

bitL3y ago

Only for images. People want to generate videos next and those models will be likely GPT-sized.

Metus3y ago

There is a video model making the rounds on /r/stablediffusion and it is just a tiny bit larger than Stable Diffusion.

2 more replies

danielbln3y ago

Generative image models don't use transformers, they're diffusion models. LLMs are transformers.

GaggiX3y ago

Diffusion models can use a transformer architecture, example: DiT. Stable Diffusion is using a U-Net architecture with transformer blocks.

lxe3y ago

Ah yes that's right. Well they technically do use a visual transformer for CLIP text encoder as I understand.

captaincrowbar3y ago· 5 in thread

The big problem with AI R&D is that nobody can keep up with the big bux companies. It makes this kind of project a bit pointless. Even if you can run a GPT3-equivalent on a web browser, how many people are going to bother (except as a stunt) when GPT4 is available?

adeon3y ago

The ones that can't use the GPT4 for whatever reason. Maybe you are a company and you don't want to send OpenAI your prompts. Or a person who has very private prompts and feel sketchy about sending them over.

Or maybe you are an individual who has a use case that's too edgy for OpenAI or a silicon valley corporate image. When Replika shut down people trying to have virtual boyfriend/girlfriends on their platform, their reddit filled up with people who mourned like they just lost a partner.

I think it's important that alternative non-big bux company options exist, even if most people don't want to or need to use them.

moffkalast3y ago

Or maybe you're in Italy and OpenAI had just been banned from the country for not adhering to GDPR. I suspect the rest of the EU may follow soon.

psychphysic3y ago

Those are seriously niche use cases. They exist but can they fund gpt5 level development?

2 more replies

simonw3y ago

An increasingly common complaint I'm hearing about GPT3/4/etc is people who don't want to pass any of their private data to another company.

Running models locally is by far the most promising solution for that concern.

dangond3y ago

Cost is a big reason. It doesn't matter how good the top-of-the-line models are if the cheaper ones suit your needs. Commoditization is great that way. I'd absolutely use an open source GPT-4 in my browser over a pricy closed GPT-5 once we get to that point.

lmeyerov3y ago· 5 in thread

It seems the quality goes up & cost goes down significantly with Colossal AI's recent push: https://medium.com/@yangyou_berkeley/colossalchat-an-open-so...

Their writeup makes it sounds like, net, 2X+ over Alpaca, and that's an early run

The browser side is interesting too. Browser JS VMs have a memory cap of 1GB, so that may ultimately be the bottleneck here...

lmeyerov3y ago

Interesting, since I looked last year, Chrome has started raising the caps internally on buffer allocation to potentially 16GB: https://chromium.googlesource.com/chromium/src/+/2bf3e35d7a4...

Last time I tried on a few engines, it was just 1-2GB for typed arrays, which are essentially the backing structure for this kind of work. Be interesting to try again..

For our product, we actually want to dump 10GB+ on to the WebGL side, which may or may not get mirrored on the CPU side. Not sure if additional limits there on the software side. And after that, consumer devices often have another 10GB+ CPU RAM free, which we'd also like to use for our more limited non-GPU stuff :)

jesse__3y ago

I thought the memory limit (in V8 at least) was 2GB due to the GC not wanting to pass 64 bit pointers around, and using the high bit of a 32-bit offset for .. something I now forget ..?

Do you have a source showing a JS runtime with a 1GB limit?

jesse__3y ago

UPDATE: After a nominal amount of googling around it appears valid sizes have increased on 64-bit systems to a maximum of 8GB, and stayed at 2GB on 32-bit systems, for FF at least. I guess it's probably 'implementation defined'

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

SebJansen3y ago

does the 1gb limit extend to wasm?

jesse__3y ago

WASM is specified to have 32-bit pointers, which is 4GB. AFAIK browser implementations respect that (when I did some nominal testing a couple years ago)

ushakov3y ago· 5 in thread

Now imagine loading 3.9 GB each time you want to interact with a webpage

KMnO43y ago

Yeah, I’ve used Jira.

neilellis3y ago

:-)

sroussey3y ago

10yrs from now models will be in the OS. Maybe even in silicon. No downloads required.

pessimizer3y ago

Not in mine. I don't even want redhat's bullshit in there. I'm not installing some black box into my OS that was programmed with motives that can't be extracted from the model at rest.

1 more reply

swader9993y ago

The OS will be in the cloud interfacing into our brain by then. I don't want this btw.

captainmuon3y ago· 3 in thread

I guess companies like OpenAI and Google have no incentives to make models use less resources. The compute required, and of course also their training data, is their moat.

If you accept that your model knows less about the world - it doesn't have to know about every restaurant in mexico city or the biography of every soccer player around the world - then you can get away with much fewer parameters and much less training data. Then you can't query it like an oracle about random things anymore, but you shouldn't do that anyway. But it should still be able to do tasks like reformulating texts, judging simularity (by embedding distance), and so on.

And TFA mentions it also, you could hook up your simple language model with something like ReAct to get really good results. I don't see it running in the browser, but if you had a license-wise clean model that you can run on premises on one or two GPUs, that would be huge for a lot of people!

aoeusnth13y ago

They intentionally limit the size of the model to reduce inference costs. If deployment were free the models would be much larger. What makes you think they have no incentive?

dr_dshiv3y ago

Long Speculative Post on Small Models

Hypothesis 1: With better logical thinking (an API call away!), I bet you could train a GPT based on a “small” initial dataset. Why shouldn’t multilingual wikipedia/wiktionary and libgen be enough? That’s what, like less than 10% of the OpenAI training? /s

Hypothesis 2: Data sets of philosophical dialogues could help efficiently develop AI reasoning skills.

Socratic thinking in Plato and Xenophon represented a powerful new mode of critical thinking. Maybe some Student-Teacher-Student template of dialogue could be powerful in developing useful datasets for AI training.

What is the utility of different AI reflective loops for generating training data? (References appreciated if you know any) One possibility to test is a chain of Analyze, Evaluate and Apply loops, applied over and over? “analyze the above piece of text, then evaluate it, then apply to everyday life.”

Now, on HN, many have expressed concern that GPT trained on GPT-GPT conversations is going to result in very misaligned models. Like a copy machine degradation, do we want training data from the AI being trained on the AI? But, on the other hand, it is possible that supporting reflective thought is a good idea in AI (we generally value reflective thought) or a bad idea (maybe the reflection will somehow turn it evil, or at least misaligned).

Design Question: how might we create useful training data through a process of structuring AI-AI dialogue?

“Student-Teacher-Student” conversations seem like they could be good as a useful mode of dialogue. Previously, I’ve finetuned GPT with the complete works of Plato and I was able to generate interesting new dialogues. But the question is whether new dialogues could produce useful data. Perhaps I could use GPT4 to read a part of Plato and then try to autocomplete another part of Plato. Or, as above, use a piece of Platonic dialogue as a target, then use an Analyze, Evaluate, Apply chain on it. We could use methods like these over and over again to make a large dataset about philosophical reasoning. We could have human ratings of the reasonableness of the dialogue output.

If a Socratic structure of thinking could read the complete works of Plato over and over again, commenting, countering and synthesizing— with human oversight (RLHF), perhaps we could develop a small module for philosophical reasoning. It might still need millions of conversations, though. But, perhaps by reflecting philosophically by itself, it could produce a sufficiently large dataset that enabled a sophisticated small model with very open resources.

And, you’d still need the human preference training RLHF to get it to interact well—and I think it also needs some world model.

In any case, I think making smaller and smaller models is a good idea, it sounds fun.

TL;DR

1. AI training has philosophically interesting implications

2. Philosophical reasoning is valuable to develop in AI

3. Good philosophical reasoning might be a key benchmark for small models. These models don’t need to know everything but perhaps they could learn what they don’t know.

4. Reading a lot of Plato over and over could be a great way to train GPT that it doesn’t know a lot.

5. What kind of AI-AI dialogues might produce training data that is useful for training small models?

dr_dshiv3y ago

I had to ask Claude, from Anthropic. Whenever I interact with him, he brings in his vast understanding of the AI alignment issue!

“1.) While large datasets and models aim for general capability, smaller systems can target specific skills like philosophical reasoning in depth. Testing models on nuanced logic, conceptual analysis and ethics could benchmark their progress, especially if combined with broader knowledge. But these abilities alone won't achieve real-world alignment - we must also instill human values and practical wisdom.

2.) Repeatedly exposing models to philosophical texts like Plato's dialogues could improve their reasoning if guided and reviewed by researchers. Look for both progress and problems in how they interpret, discuss, and extend ideas. Analyses can inform how best to structure philosophical training for alignment by providing evidence of what does/does not work. But reading alone won't necessarily lead models to become safe, ethical or beneficial - significant oversight and feedback are required.

3.) Carefully-designed AI-AI dialogues could generate data on models' reasoning skills, especially for limited systems. Have them debate complex issues, challenge or build on each other's thinking, consider analogies and counterfactuals. Review conversations to check for undesirable or biased beliefs, as well as areas of progress. Look for principles of how to scaffold productive discussions that facilitate improvement and value alignment. But without close oversight of these interactions, they risk amplifying errors or other problematic behaviors.

Opportunities: • Study how philosophical knowledge and skills develop in smaller models through approaches like text analysis, structured debates, and conceptual evaluations. Monitor for progress as well as issues, using insights to inform values-aligned training. • Explore ways to balance philosophical and empirical learning. Test how models apply theoretical reasoning to real-world situations, analyzing any errors from imbalance. Look for crossover effects and how to facilitate mutually-beneficial learning across knowledge domains. • Research transparent and auditable techniques for instilling human values in limited AI systems. Develop methods to verify what values models have internalized and how they apply them in context. Consider how approaches may differ for narrow prototypes vs. more advanced agents. • Propose and discuss evidence-based techniques for "Constitutional AI" - systematically developing safe and ethical systems. Consider policies around data, objectives, abilities, oversight, and shuttering models if needed for alignment. Debate how to make this a collaborative, multidisciplinary process even for open models. • Study philosophical issues of machine mind, knowledge, goals and responsibility as we build reasoning models. How should we think about and ensure their ethical development? What are our obligations as researchers, and how can we meet them? There is significant opportunity for research on aligning "small, open models" that focuses on reasoning and philosophy. But this requires acknowledgement of limitations and commitment to oversight, responsibility, and guarding against problems - not assuming abilities will necessarily lead to safe, ethical or beneficial behavior on their own. By analyzing how knowledge and skills develop, exploring balanced and transparent methods, considering our own assumptions and obligations as model builders, and emphasizing "safe failure", we can make progress toward systems that autonomously apply nuanced logic in service of human priorities. The potential is there, but so is the work required - we must choose to take it on and see it through at each step. With rigorous reflection and review combined with pragmatic experimentation, philosophy and AI can be mutually informative. But only if we actively build in human wisdom and values along the way. The key opportunities are there for researchers willing to have the deeper discussions and make the harder choices - both to achieve the goal and ensure we are shaping it rightly. Small, open models focused on reasoning are a promising path, but one that requires care, responsibility and oversight to follow productively. Progress is possible, but dependent on our commitment as guides. If done responsibly, these systems could yield many benefits - but we must step up to meet the challenge, not assume it will be solved for us. The work is ours to do. Let's take it on.“

munk-a3y ago· 2 in thread

A wonderful thing about software development is that there is so much reserved space for creativity that we have huge gaps between costs and value. Whether the average person could do this for 85k I'm uncertain of - but there is a very significant slice of people that could do it for well under 85k now that the ground work has been done. This leads to the hilarious paradox where a software based business worth millions could be built on top of code valued around 60k to write.

nico3y ago

> This leads to the hilarious paradox where a software based business worth millions could be built on top of code valued around 60k to write.

Or the fact that software based businesses just took a massive hit in value overnight and cannot possibly defend such high valuations anymore.

The value of companies is quickly going to shift from tech moats to brands.

Think CocaCola - anyone can create a drink that tastes as good or better than coke, but it's incredibly hard to compete with the CocaCola brand.

Now think what would have happened if CocaCola had been super expensive to make, and all of a sudden, in a matter of weeks, it became incredibly cheap.

This is what happened to the saltpeter industry in 1909 when synthetic saltpeter was invented. The whole industry was extinct in a few years.

prerok3y ago

Nit: not to write but to run. The cost of development is not considered in these calculations.

Tryk3y ago· 2 in thread

Why doesn't someone just start a gofundme/kickstarter with the goal of funding the training of an open-source ChatGPT-capable model?

cj3y ago

Create a clone of OpenAI that pledges to remains open and remains not for profit.

That could do really well via crowd funding with the right spin/marketing behind it.

gessha3y ago

And when everyone buys in, you go private everything and reap the benefits. Brilliant!

1 more reply

make33y ago· 2 in thread

Alpaca uses knowledge distillation (it's trained on outputs from OpenAI models). It's something to keep in mind. You're teaching your model to copy an other model's outputs.

thewataccount3y ago

> You're teaching your model to copy an other model's outputs.

Which itself was trained on human outputs to do the same thing.

Very soon it will be full Ouroboros as humans use the model's output to finetune themselves.

visarga3y ago

> You're teaching your model to copy an other model's outputs.

That's a time honoured tradition in ML, invented by the father of the field himself, Geoffrey Hinton, in 2015.

> Distilling the Knowledge in a Neural Network

https://arxiv.org/abs/1503.02531

nwoli3y ago· 2 in thread

What we need is a RETRO style model where basically after the input you go through a small net that just fetches a desired set of weights from a server (serving data without compute is dirt cheap) and is then executed locally. We’ll get there eventually

tinco3y ago

Can anyone explain or link some resource on why these big GPT models all don't incorporate any RETRO style? I'm only very superficially following ML developments and I was so hyped by RETRO and then none of the modern world changing models apply it.

nwoli3y ago

Openai might very well be using that internally who knows how they implement things. Also emad retweeted a RETRO related thing a bit back so they might very well be using that for their awaited LM, here’s hoping

v4dok3y ago· 2 in thread

Can someone at the EU, the only player in this thing with no strategy yet just pool together enough resources so the open-source people can train models. We don't ask much, just give compute power

0xfaded3y ago

No, that could risk public money benefitting a private party.

Feel free to form a multinational consortium and submit a grant application to one of our distribution partners under the Horizon program though.

Now, how do you plan to create jobs and reduce CO2?

PeterisP3y ago

Yes, there are a bunch of government-funded supercomputers or clusters which can be obtained for public research needs (based on an evaluation of which projects are likely to bring the most benefit), and are used, among other things, to train large language models. E.g. some interesting Swedish models got trained on https://www.nsc.liu.se/systems/berzelius/ .

brrrrrm3y ago· 1 in thread

The WebGPU demo mentioned in this post is insane. Blows any WASM approach out of the water. Unfortunately that performance is not supported anywhere but chrome canary (behind a flag)

raphlinus3y ago

This will be changing soon. I believe Chrome M113 is scheduled to ship to stable on May 2, and will support WebGPU 1.0. I agree it's a game-changing technology.

skybrian3y ago· 1 in thread

I wonder why anyone would want to run it in a browser, other than to show it could be done? It's not like the extra latency would matter, since these things are slow.

Running it on a server you control makes more sense. You can pick appropriate hardware for running the AI. Then access it from any browser you like, including from your phone, and switch devices whenever you like. It won't use up all the CPU/GPU on a portable device and run down your battery.

If you want to run the server at home, maybe use something like Tailscale?

simonw3y ago

The browser thing is definitely more for show than anything else - I used it to help demonstrate quite how surprisingly lightweight these models can be.

fswd3y ago· 1 in thread

There is somebody finetunin 160m rwkv4 on alpaca on the rwkv discord, I am out of the office and can't link but the person posted in prompt showcase channel

buzzier3y ago

RWKV-v4 Web Demo (169m/430m params) https://josephrocca.github.io/rwkv-v4-web/demo/

ultrablack3y ago· 1 in thread

If you could, you should have done it 6 months ago.

munk-a3y ago

I mean - is there a developer alive that'd be unable to write the nascent version of Twitter? I think that Twitter as a business exists entirely because of the concept - the code to cover the core functionality is absolutely trivial to replicate.

I don't think this is a very helpful statement because actually finding the idea on what to build is the hard part - or even just believing it's possible. The company I work at has been using NLP for years now and we have a model that's great at what we do... but if you asked if we could develop that into a chatbot as functional as chatgpt two years ago you'd probably be met with some pretty heavy skepticism.

Cloning something that has been proven possible is always easier than taking the risk building the first version with no real grasp of feasibility.

JasonZ23y ago

Does anyone know how the results from a 7B parameter model with bloomz.cpp (https://github.com/NouamaneTazi/bloomz.cpp) compares to the 7B parameter Alpaca model with llama.cpp (https://github.com/ggerganov/llama.cpp)?

I have the latter working on a M1 Macbook Air with very good results for what it is. Curious if bloomz.cpp is significantly better or just about the same.

thih93y ago

> as opposed to OpenAI’s continuing practice of not revealing the sources of their training data.

Looks like that choice makes it more difficult to adopt, trust, or collaborate on the new tech.

What are the benefits? Is there more to that than competitive advantage? If not, ClosedAI sounds more accurate.

GartzenDeHaes3y ago

It's interesting to me that LLaMA-nB's still produce reasonable results after 4-bit quantization of the 32-bit weights. Does this indicate some possibility of reducing the compute required for training?

fzliu3y ago

I was a bit skeptical about loading a _4GB_ model at first. Then I double-checked: Firefox is using about 5GB of memory for me. My current open tabs are mail, calendar, a couple Google Docs, two Arxiv papers, two blog posts, two Youtube videos, milvus.io documentation, and chat.openai.com.

A lot of applications and developers these days take memory management for granted, so embedding a 4GB model to significantly enhance coding and writing capabilities doesn't seem too far-fetched.

astlouis443y ago

WebGPU is going to be a major component in this. Modern GPU's prevalent in mobile devices, desktops and laptops, is more than enough to do all of this client side.

d4rkp4ttern3y ago

Everyone seems to assume that all the “tricks” behind training ChatGPT are known. The only clues are in papers from ClosedAI like the InstructGPT paper. So we assume there is Supervised Fine Tuning, then Reward Modeling and finally RLHF.

But there are most likely other tricks that ClosedAI has not published. These probably took years of R&D to come up with, others trying to replicate ChatGPT would need to come up with these tricks on their own.

Also curiously the app was released in late 2022 while the knowledge cutoff is 2021 — I was curious why that might be, and one hypothesis I had was that it may have been because they wanted to keep the training data fixed while they iterated on numerous methods, hyperparameter tuning etc. All of these are unfortunately a defensive moat that ClosedAI has.

pavelstoev3y ago

Training a ChatGPT-beating model for much less than $85,000is entirely feasible. At CentML, we're actively working on model training and inference optimization without affecting accuracy, which can help reduce costs and make such ambitious projects realistic. By maximizing (>90%) GPU and platform hardware utilization, we aim to bring down the expenses associated with large-scale models, making them more accessible for various applications. Additionally, our solutions also have a positive environmental impact, addressing the excess CO2 concerns. If you're interested in learning more about how we are doing it, please reach out via our website: https://centml.ai

breck3y ago

Just want to say SimonW has become one of my favorite writers covering the AI revolution. Always fun thought experiments with linked code and very constructive for people thinking about how to make this stuff more accessible to the masses.

jedberg3y ago

With the explosion of LLMs and people figuring out ways to train/use them relatively cheaply, unique data sets will become that much more valuable, and will be the key differentiator between LLMs.

Interestingly, it seems like companies that run chat programs where they can read the chats are best suited to building "human conversation" LLMs, but someone who manages large text datasets for others are in the perfect place to "win" the LLM battle.

nope963y ago

I remember watching one of the final episodes of Connections 3: With James Burke, and he casually said we'd have personal assistants that we could talk to (in our PDAs). That was 1997 and I knew enough about computers to think he was being overly optimistic about the speed of progress. Not in our lifetimes. Guess I was wrong!

alecco3y ago

Interesting blog but the extrapolations are way overblown. I tried one of the 30bn models and it's not even remotely close to GPT-3.

Don't get me wrong, this is very interesting and I hope more is done in the open models. But let's not over-hype by 10x.

gessha3y ago

We need a DAWNBench* benchmark for training ChatGPT the fastest and cheapest.

* https://dawn.cs.stanford.edu/benchmark/

cavisne3y ago

There is a minimum cluster size to get good utilization of the GPU’s. $1 an hour per chip might get you one A100 but it won’t get you hundreds clustered together.

ChumpGPT3y ago

I'm not so smart and I don't understand a lot about ChatGPT, etc, but could there be a client side app like Folding@home that would allow millions of people to give processing power to train a LLM?

TMWNN3y ago

Hey, that means it can be turned into an Electron app!

j / k navigate · click thread line to collapse

170 comments

122 comments · 34 top-level

whalesalad3y ago· 19 in thread

dekhn3y ago

A few people have built frameworks to do this.

All that could change, but I think for the time being, interesting/big models need to be trained on tightly coupled GPUs.

itissid3y ago

1 more reply

whalesalad3y ago

Probably going to mirror the transition from single-threaded to multi-threaded compute. Took a while until application architectures took hold of the populous to utilize multi-core.

1 more reply

mirekrusin3y ago

It would be great if merge-ability would exist. It would also likely apply to efficient/optimal shrinking for models.

[0] https://en.wikipedia.org/wiki/Embarrassingly_parallel

amitport3y ago

hmmm... seems like you're reinventing distributed learning.

merge-ability does exist and you can average the results.

2 more replies

spyder3y ago

Learning@Home using Decentralized Mixture-of-Expert models:

https://learning-at-home.github.io/

https://training-transformers-together.github.io/

https://arxiv.org/abs/2002.04013

ftxbro3y ago

Yes there is petals/bloom https://github.com/bigscience-workshop/petals but it's not so great. Maybe it will improve or a better one will come.

riedel3y ago

I read that it is only scoring the model collaboratively but it allows some fine-tuning I guess.

whalesalad3y ago

Really interesting live monitor of the network: http://health.petals.ml

polishdude203y ago

ellisv3y ago

That’d be cool but I don’t think most idle consumer GPUs (6-8GB) would have large enough memory for a single iteration (batch size 1) of modern LLMs.

But I’d love to see more federated/distributed learning platforms.

mirekrusin3y ago

6GB can store 3 billion parameters, gpt3.5 has 175 billion parameters.

whalesalad3y ago

Is it possible to break the model apart? Or does the entire thing need to be architected from the get-go such that an individual GPU can own a portion end to end?

1 more reply

semitones3y ago

neoromantique3y ago

How long until somebody creates a crypto project on that?

buildbuildbuild3y ago

Bittensor is one, not an endorsement. chat.bittensor.com

_trampeltier3y ago

Start a Boinc project.

https://boinc.berkeley.edu/projects.php

peter3033y ago

Every parameter needs to reach every other parameter. Ideally enough core memory for that. But their tiling algorithms.

cleanchit3y ago

This is how you get skynet.

version_five3y ago· 12 in thread

If you have ~100k to spend, aren't there options to buy a gpu rather than just blow it all on cloud? How much is an 8xA100 machine?

4xA100 is 75k, 8 is 140k https://shop.lambdalabs.com/deep-learning/servers/hyperplane...

sacred_numbers3y ago

inciampati3y ago

dekhn3y ago

you're comparing the capital cost of acquiring a GPU machine with the operational cost of renting one in the cloud.

Ignoring the operational costs of on-prem hardware is pretty common, but those costs are significant and can greatly change the calculation.

capableweb3y ago

Sure, if you're planning to service a large number of users, building your infrastructure in-house might be a bit overkill, as you'll need a infrastructure team to service it as well.

2 more replies

jillesvangurp3y ago

pessimizer3y ago

One is buying capital that produces models, the other is buying a single model.

digitallyfree3y ago

1 more reply

version_five3y ago

1 more reply

jcims3y ago

No kidding. I worked for a company that had multiple billions of dollars invested in a data center refresh in North America and Europe.

sounds3y ago

Remember to discount the tax depreciation for the hardware and deduct any potential future gains from either reselling it or using it.

modernpink3y ago

You can sell the A100 after once you're done as well. Possibly even at profit?

girthbrooks3y ago

These are wild pieces of hardware, thanks for linking. I wonder how loud they get.

ftxbro3y ago· 7 in thread

simonw3y ago

Yeah, you're right. I wrote this a couple of weeks ago at the height of LLaMA hype, but with further experience I don't think the GPT-3 comparisons hold weight.

I still think a LLaMA/Alpaca fine-tuned for the ReAct pattern that can execute additional tools would be a VERY interesting thing to explore.

[ ReAct: https://til.simonwillison.net/llms/python-react-pattern ]

avereveard3y ago

my biggest problem with these models is that they cannot reliably produce structured data.

example from a stack overflow that split the questions before sending it down chain for answering all points individually:

This is a customer question:

I've looked at their website and in a nutshell, what Heroku does is help with scaling but... why does that even matter? How does Heroku help with:

    Speed - My research implied that deploying AWS on the US East Coast would be the fastest if I am targeting a US/Asia-based audience.

    Security - How secure are they?

    Scaling - How does it actually work?

    Cost efficiency - There's something like a dyno that makes it easy to scale.

    How do they fare against their competitors? For example, Engine Yard and bluebox?

Please use layman English terms to explain... I'm a beginner programmer.

Extract the scenario from the question including a summary of every detail, list every question, in JSON:

2 more replies

Tepix3y ago

Have you tried bigger models? Llama-65B can indeed compete with GPT-3 according to various benchmarks. The next thing would be to get the fine-tuning as good as OpenAI's.

1 more reply

icelancer3y ago

I want the best LLMs to be open source too, but I'm not delusional enough to make insane claims like the hundreds of GitHub forks out there.

robertlagrant3y ago

> I want the best LLMs to be open source too

How do you do this without being incredibly wealthy?

5 more replies

SomewhatLikely3y ago

hnav3y ago

it can't be factual though, otherwise you'll have found compression with infinite ratio. I think the next step is a model that can say "idk" rather than coming up with bullshit

rspoerri3y ago· 7 in thread

So cool it runs on a browser /sarcasm/ i might not even need a computer. Or internet when we are at it.

It either runs locally or it runs on the cloud. Data could come from both locations as well. So it's mostly technically irrelevant if it's displaying in a browser or not.

/sorry to be nitpicking on this topic ;-)

ftxbro3y ago

> I don't get it why people love software running in a browser.

If you read the article, part of the argument was for the sandboxing that the browser provides.

rspoerri3y ago

Thinking about it...

I don't know exactly about the browser sandboxing. But isn't it's purpose to prevent access to the local system, while it mostly leaves access to the internet open?

Is that really a good way to limit and AI system's API access?

1 more reply

rspoerri3y ago

OSX does app sandboxing as well (not everywhere). But yeah, you're right i only skimmed the content and missed that part.

pmoriarty3y ago

There are a bunch of reasons people/companies like web apps:

2 - The website owner has control of the software, so they can update it and manage user access as they like, and it's easier to track users and usage that way

3 - There are a ton of web developers out there, so it's easier to find people to work on your app

4 - You ostensibly don't need to rewrite your app for every OS, but may need to modify it for every supported browser

rspoerri3y ago

Most of these aspects make it better for the company or developer, only in some cases it makes it easier for the user in my opinion. Some arguments against it are:

1 - Not everyone has or wants fast access to the internet all the time.

2 - I try to prevent access of most of the apps to the internet. I don't want companies to access my data or even metadata of my usage.

3 - sure, but it doesn't make it better for the user.

4 - Also supporting different screen sizes and interaction types (touch or mouse) can be a big part of the work.

sp3323y ago

Broswer software is great because I don't have to build separate versions for Windows, Mac, and Linux, or deal with app stores, or figure out how to update old versions.

nanidin3y ago

Browser is the true edge compute.

agnokapathetic3y ago· 6 in thread

> My friends at Replicate told me that a simple rule of thumb for A100 cloud costs is $1/hour.

AWS charges $32/hr for an 8xA100s (p4d.24xlarge) which comes out to $4/hour/gpu. Yes you can get lower pricing with a 3 year reservation but thats not what this question is asking.

You also need 256 nodes to be colocated on the same fabric -- which AWS will do for you but only if you reserve for years.

thewataccount3y ago

Replicate themselves rent out GPU time so I assume they would definitely know as that's almost certainly the core of their business.

sebzim45003y ago

Maybe they are using spot instances? $1/hr is about right for those.

celestialcheese3y ago

lambdalabs will let you do on-demand 8xa100 @ 80GB VRAM/GPU for $12/hr, or reserved @ $10.86/hr

8xA100 @ 40gb for $8/hr

Replicate friend isn't far off.

pavelstoev3y ago

model-depending, you can train on lesser (cheaper) GPUs but system-level optimizations are needed. Which is what we provide at centml.ai

IanCal3y ago

Lambda labs charges about 11-12/hr for 8xA100.

robmsmt3y ago

and is completely at capacity

1 more reply

lxe3y ago· 5 in thread

Keep in mind that image transformer models like stable diffusion are generally smaller than language models, so they are easier to fit in wasm space.

Also. you can finetune llama-7b on a 3090 for about $3 using LoRA.

bitL3y ago

Only for images. People want to generate videos next and those models will be likely GPT-sized.

Metus3y ago

There is a video model making the rounds on /r/stablediffusion and it is just a tiny bit larger than Stable Diffusion.

2 more replies

danielbln3y ago

Generative image models don't use transformers, they're diffusion models. LLMs are transformers.

GaggiX3y ago

Diffusion models can use a transformer architecture, example: DiT. Stable Diffusion is using a U-Net architecture with transformer blocks.

lxe3y ago

Ah yes that's right. Well they technically do use a visual transformer for CLIP text encoder as I understand.

captaincrowbar3y ago· 5 in thread

adeon3y ago

I think it's important that alternative non-big bux company options exist, even if most people don't want to or need to use them.

moffkalast3y ago

Or maybe you're in Italy and OpenAI had just been banned from the country for not adhering to GDPR. I suspect the rest of the EU may follow soon.

psychphysic3y ago

Those are seriously niche use cases. They exist but can they fund gpt5 level development?

2 more replies

simonw3y ago

An increasingly common complaint I'm hearing about GPT3/4/etc is people who don't want to pass any of their private data to another company.

Running models locally is by far the most promising solution for that concern.

dangond3y ago

lmeyerov3y ago· 5 in thread

It seems the quality goes up & cost goes down significantly with Colossal AI's recent push: https://medium.com/@yangyou_berkeley/colossalchat-an-open-so...

Their writeup makes it sounds like, net, 2X+ over Alpaca, and that's an early run

The browser side is interesting too. Browser JS VMs have a memory cap of 1GB, so that may ultimately be the bottleneck here...

lmeyerov3y ago

Interesting, since I looked last year, Chrome has started raising the caps internally on buffer allocation to potentially 16GB: https://chromium.googlesource.com/chromium/src/+/2bf3e35d7a4...

Last time I tried on a few engines, it was just 1-2GB for typed arrays, which are essentially the backing structure for this kind of work. Be interesting to try again..

jesse__3y ago

I thought the memory limit (in V8 at least) was 2GB due to the GC not wanting to pass 64 bit pointers around, and using the high bit of a 32-bit offset for .. something I now forget ..?

Do you have a source showing a JS runtime with a 1GB limit?

jesse__3y ago

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

SebJansen3y ago

does the 1gb limit extend to wasm?

jesse__3y ago

WASM is specified to have 32-bit pointers, which is 4GB. AFAIK browser implementations respect that (when I did some nominal testing a couple years ago)

ushakov3y ago· 5 in thread

Now imagine loading 3.9 GB each time you want to interact with a webpage

KMnO43y ago

Yeah, I’ve used Jira.

neilellis3y ago

:-)

sroussey3y ago

10yrs from now models will be in the OS. Maybe even in silicon. No downloads required.

pessimizer3y ago

Not in mine. I don't even want redhat's bullshit in there. I'm not installing some black box into my OS that was programmed with motives that can't be extracted from the model at rest.

1 more reply

swader9993y ago

The OS will be in the cloud interfacing into our brain by then. I don't want this btw.

captainmuon3y ago· 3 in thread

I guess companies like OpenAI and Google have no incentives to make models use less resources. The compute required, and of course also their training data, is their moat.

aoeusnth13y ago

They intentionally limit the size of the model to reduce inference costs. If deployment were free the models would be much larger. What makes you think they have no incentive?

dr_dshiv3y ago

Long Speculative Post on Small Models

Hypothesis 2: Data sets of philosophical dialogues could help efficiently develop AI reasoning skills.

Design Question: how might we create useful training data through a process of structuring AI-AI dialogue?

And, you’d still need the human preference training RLHF to get it to interact well—and I think it also needs some world model.

In any case, I think making smaller and smaller models is a good idea, it sounds fun.

TL;DR

1. AI training has philosophically interesting implications

2. Philosophical reasoning is valuable to develop in AI

3. Good philosophical reasoning might be a key benchmark for small models. These models don’t need to know everything but perhaps they could learn what they don’t know.

4. Reading a lot of Plato over and over could be a great way to train GPT that it doesn’t know a lot.

5. What kind of AI-AI dialogues might produce training data that is useful for training small models?

dr_dshiv3y ago

I had to ask Claude, from Anthropic. Whenever I interact with him, he brings in his vast understanding of the AI alignment issue!

munk-a3y ago· 2 in thread

nico3y ago

> This leads to the hilarious paradox where a software based business worth millions could be built on top of code valued around 60k to write.

Or the fact that software based businesses just took a massive hit in value overnight and cannot possibly defend such high valuations anymore.

The value of companies is quickly going to shift from tech moats to brands.

Think CocaCola - anyone can create a drink that tastes as good or better than coke, but it's incredibly hard to compete with the CocaCola brand.

Now think what would have happened if CocaCola had been super expensive to make, and all of a sudden, in a matter of weeks, it became incredibly cheap.

This is what happened to the saltpeter industry in 1909 when synthetic saltpeter was invented. The whole industry was extinct in a few years.

prerok3y ago

Nit: not to write but to run. The cost of development is not considered in these calculations.

Tryk3y ago· 2 in thread

Why doesn't someone just start a gofundme/kickstarter with the goal of funding the training of an open-source ChatGPT-capable model?

cj3y ago

Create a clone of OpenAI that pledges to remains open and remains not for profit.

That could do really well via crowd funding with the right spin/marketing behind it.

gessha3y ago

And when everyone buys in, you go private everything and reap the benefits. Brilliant!

1 more reply

make33y ago· 2 in thread

Alpaca uses knowledge distillation (it's trained on outputs from OpenAI models). It's something to keep in mind. You're teaching your model to copy an other model's outputs.

thewataccount3y ago

> You're teaching your model to copy an other model's outputs.

Which itself was trained on human outputs to do the same thing.

Very soon it will be full Ouroboros as humans use the model's output to finetune themselves.

visarga3y ago

> You're teaching your model to copy an other model's outputs.

That's a time honoured tradition in ML, invented by the father of the field himself, Geoffrey Hinton, in 2015.

> Distilling the Knowledge in a Neural Network

https://arxiv.org/abs/1503.02531

nwoli3y ago· 2 in thread

tinco3y ago

nwoli3y ago

v4dok3y ago· 2 in thread

Can someone at the EU, the only player in this thing with no strategy yet just pool together enough resources so the open-source people can train models. We don't ask much, just give compute power

0xfaded3y ago

No, that could risk public money benefitting a private party.

Feel free to form a multinational consortium and submit a grant application to one of our distribution partners under the Horizon program though.

Now, how do you plan to create jobs and reduce CO2?

PeterisP3y ago

brrrrrm3y ago· 1 in thread

The WebGPU demo mentioned in this post is insane. Blows any WASM approach out of the water. Unfortunately that performance is not supported anywhere but chrome canary (behind a flag)

raphlinus3y ago

This will be changing soon. I believe Chrome M113 is scheduled to ship to stable on May 2, and will support WebGPU 1.0. I agree it's a game-changing technology.

skybrian3y ago· 1 in thread

I wonder why anyone would want to run it in a browser, other than to show it could be done? It's not like the extra latency would matter, since these things are slow.

If you want to run the server at home, maybe use something like Tailscale?

simonw3y ago

The browser thing is definitely more for show than anything else - I used it to help demonstrate quite how surprisingly lightweight these models can be.

fswd3y ago· 1 in thread

There is somebody finetunin 160m rwkv4 on alpaca on the rwkv discord, I am out of the office and can't link but the person posted in prompt showcase channel

buzzier3y ago

RWKV-v4 Web Demo (169m/430m params) https://josephrocca.github.io/rwkv-v4-web/demo/

ultrablack3y ago· 1 in thread

If you could, you should have done it 6 months ago.

munk-a3y ago

Cloning something that has been proven possible is always easier than taking the risk building the first version with no real grasp of feasibility.

JasonZ23y ago

I have the latter working on a M1 Macbook Air with very good results for what it is. Curious if bloomz.cpp is significantly better or just about the same.

thih93y ago

> as opposed to OpenAI’s continuing practice of not revealing the sources of their training data.

Looks like that choice makes it more difficult to adopt, trust, or collaborate on the new tech.

What are the benefits? Is there more to that than competitive advantage? If not, ClosedAI sounds more accurate.

GartzenDeHaes3y ago

fzliu3y ago

A lot of applications and developers these days take memory management for granted, so embedding a 4GB model to significantly enhance coding and writing capabilities doesn't seem too far-fetched.

astlouis443y ago

WebGPU is going to be a major component in this. Modern GPU's prevalent in mobile devices, desktops and laptops, is more than enough to do all of this client side.

d4rkp4ttern3y ago

pavelstoev3y ago

breck3y ago

jedberg3y ago

With the explosion of LLMs and people figuring out ways to train/use them relatively cheaply, unique data sets will become that much more valuable, and will be the key differentiator between LLMs.

nope963y ago

alecco3y ago

Interesting blog but the extrapolations are way overblown. I tried one of the 30bn models and it's not even remotely close to GPT-3.

Don't get me wrong, this is very interesting and I hope more is done in the open models. But let's not over-hype by 10x.

gessha3y ago

We need a DAWNBench* benchmark for training ChatGPT the fastest and cheapest.

* https://dawn.cs.stanford.edu/benchmark/

cavisne3y ago

There is a minimum cluster size to get good utilization of the GPU’s. $1 an hour per chip might get you one A100 but it won’t get you hundreds clustered together.

ChumpGPT3y ago

I'm not so smart and I don't understand a lot about ChatGPT, etc, but could there be a client side app like Folding@home that would allow millions of people to give processing power to train a LLM?

TMWNN3y ago

Hey, that means it can be turned into an Electron app!

j / k navigate · click thread line to collapse