Z80-μLM is a character-level language model with 2-bit quantized weights ({-2,-1,0,+1}) that runs on a Z80 with 64KB RAM. The entire thing: inference, weights, chat UI, it all fits in a 40KB .COM file that you can run in a CP/M emulator and hopefully even real hardware!
It won't write your emails, but it can be trained to play a stripped down version of 20 Questions, and is sometimes able to maintain the illusion of having simple but terse conversations with a distinct personality.
--
The extreme constraints nerd-sniped me and forced interesting trade-offs: trigram hashing (typo-tolerant, loses word order), 16-bit integer math, and some careful massaging of the training data meant I could keep the examples 'interesting'.
The key was quantization-aware training that accurately models the inference code limitations. The training loop runs both float and integer-quantized forward passes in parallel, scoring the model on how well its knowledge survives quantization. The weights are progressively pushed toward the 2-bit grid using straight-through estimators, with overflow penalties matching the Z80's 16-bit accumulator limits. By the end of training, the model has already adapted to its constraints, so no post-hoc quantization collapse.
Eventually I ended up spending a few dollars on Claude API to generate 20 questions data (see examples/guess/GUESS.COM), I hope Anthropic won't send me a C&D for distilling their model against the ToS ;P
But anyway, happy code-golf season everybody :)
https://i.imgur.com/6TRe1NE.png
Thank you for posting! It's unbelievable how someone sometimes just drops something that fits right into what you're doing. However bizarre it seems.
I developed a browser-based CP/M emulator & IDE: https://lockboot.github.io/desktop/
I was going to post that instead, but wanted a 'cool demo' instead, and fell down the rabbit hole.
I wrote a console-based emulator, and a simple CP/M text-adventure game somewhat recently
https://github.com/skx/cpmulator/
At some point I should rework my examples/samples to become a decent test-suite for CP/M emulators. There are so many subtle differences out there.
It seems I could even upload a zipfile of my game, but the escape-codes for clearing the screen don't work, sadly:
The interaction is surprisingly good despite the lack of attention mechanism and the limitation of the "context" to trigrams from the last sentence.
This could have worked on 60s-era hardware and would have completely changed the world (and science fiction) back then. Great job.
Tin foil hat on: i think that a huge part of the major buyout of ram from AI companies is to keep people from realising that we are essentially at the home computer revolution stage of llms. I have a 1tb ram machine which with custom agents outperforms all the proprietary models. It's private, secure and won't let me be motetized.
Ultimately, if you can build an ultra tiny model that can talk and learn on the fly, you've just fully localized a personal assistant like Siri.
Not exactly "minimal viable", but a "what if RNNs where good for LLMs" case study.
-> insanely fast on CPUs
Edit: The fact this runs on a Smartphone means it is highly relevant. My only thing is, how do we give such a model an "unlimited" context window, so it can digest as much as it needs. I know some models know multiple languages, I wouldnt be surprised if sticking to only English would reduce the model size / need for more hardware and make it even smaller / tighter.
I doubt it would be able to make good use of a large context window, though.
You can buy a kid’s tiger electronics style toy that plays 20 questions.
It’s not like this LLM is bastion of glorious efficiency, it’s just stripped down to fit on the hardware.
Slack/Teams handles company-wide video calls and can render anything a web browser can, and they run an entire App Store of apps, all from a cross-platform application.
Including Jira in the conversation doesn’t even make logical sense. It’s not a desktop application that consumes memory. Jira has such a wide scope that the word “Jira” doesn’t even describe a single product.
The 4th Gen iPod touch had 256 meg of RAM and also did those things, with video calling via FaceTime (and probably others, but I don't care). Well, except "cross platform", what with it being the platform.
That's a bug not a feature, and strongly coupled to the root cause for slack's bloat.
“Planting Undetectable Backdoors in Machine Learning Models”
“ … On the surface, such a backdoored classifier behaves normally, but in reality, the learner maintains a mechanism for changing the classification of any input, with only a slight perturbation. Importantly, without the appropriate "backdoor key", the mechanism is hidden and cannot be detected by any computationally-bounded observer. We demonstrate two frameworks for planting undetectable backdoors, with incomparable guarantees. …”
It could with a network this small. More generally this falls under "interpretability."
(edit: change url)
This means that a directly translated 40 KB Z80 executable might be a tight squeeze on that mainframe, because 40K > 32K, counting words, not bytes. Of course if most of that size is just 2-bit weight data then it might not be so bad.
ELIZA running on later hardware would have been a different story, with the Z80 - released in 1976 - being an example.
It's just one-shot AI slop - literally, the prompt was 'make a web based version of [github url of this project]' and it spat this out. It appears to work fine.
I'll keep it up for a couple of months and then it'll be auto-deleted, no sense in keeping it around longer than that.
EDIT: Actually thinking about it some more…
- Imagine what you could do with 16-bit games of the era with one or more of these models embedded. Swap the model depending on the use case within the game. Great for adventures, RPGs, strategy, puzzle, and trading games (think Elite). With 512K or 1MB of RAM, plus 2 - 4 floppies (which became increasingly common as the era wore on), you could probably do a lot, especially if the outcomes of conversations can result in different game outcomes
- Back in the day nobody was really trying to do anything serious with AI on 8 or even most 16-bit machines, because nobody thought they were powerful enough to do anything useful with. Now the thinking has changed to how much somewhat useful intelligence can I cram into the least powerful device, even if that’s only for fun?
- Imagine showing this running on a CP/M machine, like the C128, to a serious AI researcher working back in the 1980s. Minds blown, right?
- Now spool forward 10 years into the 1990s and think what PC hardware of that era would have been capable of with these limited language models. I wonder what that era might have looked like with something that seems like somewhat useful conversational AI? A sort of electro-steampunk-ish vibe maybe? People having really odd conversations with semi-capable home automation running via their PCs.
I tried on a cycle-accurate emulator of a TRS-80 Model I with Omikron CP/M mapper. Most Z-80 machines of the time were 4MHz, but the TRS-80 was only 1.77 MHz.
1. Type "GUESS", get question prompt.
2. User types: "Are you an animal?", ENTER key
3. Wait 25 seconds
4. Program prints "N"
5. Wait 20 seconds
6. Program prints "O"
7. Wait 23 seconds
8. Program prints linefeed, returns to question prompt
Total time to return 2-char answer to user's question: 1 min 9 sec or so. I bet a longer answer would take proportionally longer.
"The wonder isn't that it does it well, it's a wonder it does it at all."
I think I can do a little bit better; maybe 10% faster.
Quake 3 is probably the last game where you would expect a chatbot, as there are few games where storytelling matters less and it is a very little known feature, but Quake 3 bots can react to what you say in the chat, in addition to the usual taunts.
But that's the thing, Quake 3 can do it because it is inconsequential, in a story-driven game like a RPG, NPCs have a well defined spot in the story and gameplay, they tell you exactly what you need to know, as to not disrupt the flow of the story. Tell you too much, and they will spoil the big reveal, tell you too little, and you don't know what to do, tell you irrelevant details and you get lost chasing them. It has to be concise and to the point, so that those who don't really care know what to do to advance the story, but with enough flavor to make the world alive. It is really hard to find the right balance, and if in addition, you have to incorporate a chatbot, it borders on impossible.
It looks like a good idea on the surface, but it most likely isn't, unless it is clearly not part of the main gameplay loop, as in Quake 3.
Some people had some success using a (big) LLM as a DM in D&D, which I think is easier since it can make up the story as it advances, it is much harder to make up game elements in a computer RPG that are not programmed in.
Biggest pain point is likely the text input.
Have you experimented with having it less quantized, and evaluated the quality drop?
Regardless, very cool project.
It depends on the model, but from my experiments (quantizing one layer of a model to 2-bit and then training the model with that layer in 2-bit to fix the damage) the first layer is the most sensitive, and yes, the last layer is also sensitive too. The middle layers take the best to quantization.
Different components of a layer also have a different sensitivity; e.g. the MLP downscale block damages the model the most when quantized, while quantizing the Q projection in self attention damages the model the least.
Even with modern supercomputing the computation would be outpaced by the heat death of the universe, so token output must be limited to a single integer.
A web version would also be cool.
Speaking of - I remember my first digital camera (Fujitsu 1Mb resolution using SmartMedia)… it used so much power that you could take 20-30 photos and then needed to replace all 4 batteries lol
*burns you at the stake*