CS336: Language Modeling from Scratch (opens in new tab)

(cs336.stanford.edu)

558 pointskristianpaul26d ago51 comments

51 comments

50 comments · 19 top-level

skerit26d ago· 10 in thread

> GPU compute for self-study

Those suggestions they make for a B200 start at $4.99 an hour.

Is that really required, for starting out? I've been tinkering with my own from-scratch LLM, but in the early phases I don't need anything more than a 4090 on Vast.ai

marcelroed26d ago

TA here. Definitely not! In fact we explicitly added sections in the first assignment to allow for scaling down to even local compute (M-series GPUs). For assignment 2 there are a few regions that require Triton support for your GPU, but everything can be adapted for much cheaper GPUs.

We were lucky enough to get Blackwell GPUs for Stanford students this year, which is why the writeups are written mostly around them.

derefr26d ago

I imagine it's a lot like FPGAs:

- the hardware you need for a production use-case is relatively small, because production {models, bitstreams} have been heavily size-optimized, stripping out everything not needed to get a good result for the target use-cases

- but the hardware you need when tinkering/learning how to design {compute kernels, IP blocks} in the first place, must be quite a bit more powerful / higher-capacity, because your experiments will intentionally be the opposite of optimized: they'll be built for legibility / introspectability / debuggability at every level, which massively inflates and de-optimizes the resulting {model, bitstream}.

(And, to be clear here, "running someone else's finished model, which was designed and optimized to be used on something like a 4090, against your own prompt" is a kind of experimenting, which is cheap, in the same way that "deploying someone else's pre-baked FPGA bitstream, that was designed and synthesized for a $20 target FPGA, onto your own instance of that $20 FPGA, and then feeding your own input signals to it" is cheap. But that's not the kind of experimenting you'd be doing in this course while learning to design your own models!)

grahameb26d ago

It seems strange that the required resources aren't provided by the educational institution?

marcelroed26d ago

We do provide resources for enrolled students. The online suggestions are for external students or Stanford students who we weren't able to admit.

ReptileMan26d ago

Two schools of thought - people are paying 100K per year, we should provide everything. Second is - they are paying 100K per year, do you think they will care for couple of hundred more.

Maxatar26d ago

It says it's for self-study, ie. those who are not enrolled in the course.

root-parent26d ago

You dont even need a GPU to train your own LLM.

_0ffh26d ago

You're right to be sceptical. I have trained reasonably good SLMs for the TinyStories dataset on my 4060Ti (16GB) with no problems. You'll only encounter problems if you want to try if your ideas scale up to models any bigger than "arguably tiny".

flakiness26d ago

I beliee these are affordable enough for the intended audience (which is Stanford undergrad/master)

mrcrm949426d ago

for them Modal is sponsoring the compute, as stated on the website, the prices are for remote followers

fg13726d ago· 3 in thread

I recently completed the 2025 version of this course (video + most assignments, skipping some of the most costly part of the tasks). That's quite something. There is a lot going on in the first two assignments which required a ton of thinking and debugging. Despite having a decent foundation in deep learning, it took me several months to finish it using bits of my after-work hours and weekends. (I am not a model part-time student by any means, and sometimes I didn't get to work on this for days, but it could have been much worse.) Hard to imagine how enrolled Stanford students manage to submit assignments in two week cadence.

Coming back to the course, kudos to the course staff, including professors and TAs. The obviously put a ton of thought in designing the course, putting together those slides that contain the latest updates of the field, and preparing the wonderful assignments. You get to create a real LM and explore other important parts of LLM pipeline from small building blocks and validate them, validate each step, and see for yourself how everything comes together. You can really feel a sense of achievement after completing the assignments.

That said, while the staff obviously put a lot of effort into making this accessible to everyone, I wish they made a bit more effort in clarifying the environment requirement. Their harness works best on a Linux environment with NVIDIA GPU, which may be taken for granted for researchers but rare for home computer setup. Their setup also expects specific CUDA versions and/or architectecture. For following at home, the next best setup is Windows with WSL2 + NVIDIA GPU, plus leased GPUs on various platforms, none of which is exactly trivial (or cheap, for that matter). It would be nice if the staff could put together a bit more guidance in that area, especially how someone without any compatible GPU can make the most out of the course. (One thing I learned is that if you use Mac OS and are not careful about memory analysis, your python code could freeze and force reboot your machine).

marcelroed26d ago

TA here. Noted! I now have more resources to test more environments, and will do so whenever possible. I think freezing due to memory overuse is going to be a problem with anything you code yourself, but I do think we could be more rigorous with guiding people to achieve limited memory use for the tokenizer task.

IMO the cost of renting GPUs is a bit overstated in these comments. Generally almost all of the development can be done locally, and then ran for a short period of time using on-demand GPUs. For assignment 1, you can run everything on your local machine, even if you don't have a GPU. For A1 and A2, you can do (most of) the tasks with only a few hours of renting. Without being too careful using rental GPUs throughout will net you around $200 of a compute budget, but you can easily get this under $50 if you're willing to scale down many of the problems. I think we could work on making this clear and charting what these changes are.

If you have further feedback or encounter problems, feel free to open issues in the repos so we can resolve them! It's hard for us to fix issues we're not aware of.

fg13726d ago

Memory overuse: for context, it's about parallelism on gloo backend with CPU. My observation is that on Linux, the same (bad) python code will result in the process getting killed quickly, saving user the trouble of rebooting. Not sure if MacOS behavior is expected in the first place.

GPU cost: most of us will spend at least a few hours of troubleshooting to get started on a leased GPU, including but not limited to figuring out how much storage is needed, if CUDA version works well etc. No GPU is definitely possible but difficult. Plus, one issue might be that most of us just don't have enough experience working with them, resulting in more time figuring things out.

Github issues -- noted, will create any issue that I can think of.

zaptrem25d ago

OOM on CUDA GPUs is relatively graceful (the process crashes). However, on macOS if torch MPS tries to allocate too much memory, the whole kernel will simply lock up and the only option is to reboot the computer. I have no idea why Apple doesn’t reserve memory for stuff like the OOM/kernel watchdog, but it seems they either don’t or there is a bug.

AJRF26d ago· 3 in thread

Can anyone answer question - whats the minimum viable GPU to follow along with this course at home?

I have a 5080 16GB, are they really needing more than that in this course?

pell26d ago

The first section can be done on a M1 chip, I think the second one needs Triton support, so your 5080 should be fine.

alok-g26d ago

How about NVIDIA GeForce RTX 2060 (6 GB)? Would that be sufficient too? I am using Windows 11. Thanks.

AJRF26d ago

Thank you

airstrike26d ago· 3 in thread

I wonder if people prefer to learn this on their own or if building a community around open learning is something that others are interested in

danbrooks26d ago

I'd be interested in joining a discord server.

Would be great to have a community to discuss the material - even if folks can't commit to the full course.

silvretta24d ago

+1, did you end up creating a discord server?

type425d ago

I would be interested in this. I opened this thread to see if anyone posted about doing the course as a group!

tmule26d ago· 3 in thread

Are video lectures available online?

Bilal_io26d ago

Youtube playlist link from the page https://www.youtube.com/watch?v=JuoVZkPBiKk&list=PLoROMvodv4...

aerohit26d ago

https://www.youtube.com/watch?v=JuoVZkPBiKk&list=PLoROMvodv4...

mindcrime26d ago

https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaT...

dominotw26d ago· 2 in thread

i recently started reading "build reasoning model from scratch" then i realized that i am not really interested in building part and just want to understand theory and practice behind it.

A want like a casual lesswrong style from ground up explanation.

ianand26d ago

In that case I humbly suggest my talk from AI Engineer World's Fair https://www.youtube.com/watch?v=ZuiJjkbX0Og

Gives you the basics on LLM internals in about 90 minutes and includes an already built model in JavaScript that you can step through in browser devtools to get as detailed as you want.

ayewo25d ago

Thanks for linking to it! Just in time for my own learner's journey.

storus26d ago· 2 in thread

Thanks for releasing this again! What are this year's changes to prior offerings?

marcelroed26d ago

TA here. Biggest changes are in the second assignment (distributed) where we added a bunch of memory, profiling and distributed tasks, as well as in the fifth assignment (alignment), where most of the RL tasks are fresh this year. Assignment 3 (scaling laws) was also completely updated, but in a way that might be difficult to run without substantial resources. I'm working on a way for external students to be able to run simulated experiments for free!

Assignment 1 (basics) has the most hours of preparation invested in it, and only minor modernization/bug fixes were necessary this year.

5555watch26d ago

How are you grading the student submissions? Also, do you catch students who fully use AI and don't follow the Honor code? If so, how?

1 more reply

meken26d ago· 1 in thread

I have fond memories of cs224d [1] taught by richardsocher. It’s a bit dated at this point as it was created in the pre-transformer era, but it was a very cool introduction to applying deep learning to nlp at the time.

[1] https://cs224d.stanford.edu

egl202026d ago

Similar thoughts here. That was when I realized the potential of the Internet: I didn't have to be a grad student at a tier 1 research university to learn about the frontier.

chainsaw1026d ago· 1 in thread

I’m intrigued by this course. However I’m also curious about its prerequisite:

> Machine Learning (e.g. CS221, CS229, CS230, CS124, CS224N) You should be comfortable with the basics of machine learning and deep learning.

Anyone have a good implementation-heavy self-study resource for those topics, or experience with the recorded lectures for those Stanford courses?

alec_heif26d ago

I found the 2024 Spring CS224N course sufficient for this pre-requisite, coupled with the textbook (chapters 1-13). Like CS336, this one also has videos and assignments available, and it being from 2024 is not a problem since the basics are mostly the same as recent years. Notably this is not true for 336, which spends much more time discussing cutting edge techniques, so the 2026 version there is essential.

Course: https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1246...

Lecture videos: https://www.youtube.com/playlist?list=PLoROMvodv4rOaMFbaqxPD...

Textbook: https://web.stanford.edu/~jurafsky/slp3/

wandering-nomad25d ago· 1 in thread

Can the assignments be done on MBP M5 Max? I am hoping to get my hands on one in next couple of weeks and really want to pursue this course

jvican25d ago

Most of them can, but assignment 2 requires Nvidia GPUs as it heavily depends on you getting acquainted with Triton, Nsight compute and other GPU low-level programming tools.

armas26d ago· 1 in thread

I independently worked on the first two assignments over the course of a year. I learned so much! I was wondering what other courses people took on afterwards :)

sonabinu26d ago

I’m watching the Frontier Systems videos - https://cs153.stanford.edu/

The one issue I had with the CS336 course was the delivery of the RL components. I liked Lectures 5 & 6 from CME 295 better

https://cme295.stanford.edu/syllabus/

I’ve heard good things about the diffusion models class as well - CME 296. Seems like a good next step.

https://cme296.stanford.edu/syllabus/

delis-thumbs-7e26d ago· 1 in thread

I’d love to do this, but I’m afraid I still lack some of the required skills. But perhaps one day!

jimbokun26d ago

Then start with one of the pre-req courses.

tevlon26d ago

Couple days ago, i used Claude to implement an improved version of gpt-1. I am no ML Engineer by no means. I am just a normal backend engineer. I ended up creating a hybrid between gpt-1 and modded-nanogpt (from KellerJordan).

I was able to reproduce the results of the original gpt-1 paper with my gaming PC. I don't even have alot of VRAM. My NVIDIA GeForce RTX 2060 SUPER was able to reproduce most of the results with just 1 hour of training. I would totally recommend to do the same, if you are interested in pre-training LLMs.

The code is here: https://github.com/epoyraz/modded-gpt-1 But, you can also just ask Claude 4.8 or Codex 5.5

sonabinu26d ago

I brought a group together to do this class using the YouTube videos and course materials available online. It is challenging but rewarding. We tackled it one lecture video per week. Started with over 30 learners and by last session we were down to 8.

lblock25d ago

I'm currently going over the lectures and assignments of this course on the side of work. And I can only recommend it, it is great quality and the 2026 version is really up to date with architecture decisions. I'm really enjoying it. And I also appreciate the low compute tips to run things on mac

Oarch26d ago

Oh this is brilliant, I've spent the last month doing something just like this. As a challenge, no libraries allowed besides Python standard libs (so no numpy).

Started with Word2Vec, built an RNN, then LSTM and am halfway through building transformer architecture.

ChrisArchitect26d ago

AI Agent Guidelines for CS336 at Stanford https://github.com/stanford-cs336/assignment1-basics/blob/ma... (https://news.ycombinator.com/item?id=48359232)

artemonster26d ago

I wish there was an option to render "executable" lectures as PDF too. Id love to scroll these while commuting to-from work

netheril9625d ago

Is it possible to complete assignments with AMD GPUs? Preferably on Windows.

j / k navigate · click thread line to collapse

51 comments

50 comments · 19 top-level

skerit26d ago· 10 in thread

> GPU compute for self-study

Those suggestions they make for a B200 start at $4.99 an hour.

Is that really required, for starting out? I've been tinkering with my own from-scratch LLM, but in the early phases I don't need anything more than a 4090 on Vast.ai

marcelroed26d ago

We were lucky enough to get Blackwell GPUs for Stanford students this year, which is why the writeups are written mostly around them.

derefr26d ago

I imagine it's a lot like FPGAs:

grahameb26d ago

It seems strange that the required resources aren't provided by the educational institution?

marcelroed26d ago

We do provide resources for enrolled students. The online suggestions are for external students or Stanford students who we weren't able to admit.

ReptileMan26d ago

Two schools of thought - people are paying 100K per year, we should provide everything. Second is - they are paying 100K per year, do you think they will care for couple of hundred more.

Maxatar26d ago

It says it's for self-study, ie. those who are not enrolled in the course.

root-parent26d ago

You dont even need a GPU to train your own LLM.

_0ffh26d ago

flakiness26d ago

I beliee these are affordable enough for the intended audience (which is Stanford undergrad/master)

mrcrm949426d ago

for them Modal is sponsoring the compute, as stated on the website, the prices are for remote followers

fg13726d ago· 3 in thread

marcelroed26d ago

If you have further feedback or encounter problems, feel free to open issues in the repos so we can resolve them! It's hard for us to fix issues we're not aware of.

fg13726d ago

Github issues -- noted, will create any issue that I can think of.

zaptrem25d ago

AJRF26d ago· 3 in thread

Can anyone answer question - whats the minimum viable GPU to follow along with this course at home?

I have a 5080 16GB, are they really needing more than that in this course?

pell26d ago

The first section can be done on a M1 chip, I think the second one needs Triton support, so your 5080 should be fine.

alok-g26d ago

How about NVIDIA GeForce RTX 2060 (6 GB)? Would that be sufficient too? I am using Windows 11. Thanks.

AJRF26d ago

Thank you

airstrike26d ago· 3 in thread

I wonder if people prefer to learn this on their own or if building a community around open learning is something that others are interested in

danbrooks26d ago

I'd be interested in joining a discord server.

Would be great to have a community to discuss the material - even if folks can't commit to the full course.

silvretta24d ago

+1, did you end up creating a discord server?

type425d ago

I would be interested in this. I opened this thread to see if anyone posted about doing the course as a group!

tmule26d ago· 3 in thread

Are video lectures available online?

Bilal_io26d ago

Youtube playlist link from the page https://www.youtube.com/watch?v=JuoVZkPBiKk&list=PLoROMvodv4...

aerohit26d ago

https://www.youtube.com/watch?v=JuoVZkPBiKk&list=PLoROMvodv4...

mindcrime26d ago

https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaT...

dominotw26d ago· 2 in thread

i recently started reading "build reasoning model from scratch" then i realized that i am not really interested in building part and just want to understand theory and practice behind it.

A want like a casual lesswrong style from ground up explanation.

ianand26d ago

In that case I humbly suggest my talk from AI Engineer World's Fair https://www.youtube.com/watch?v=ZuiJjkbX0Og

Gives you the basics on LLM internals in about 90 minutes and includes an already built model in JavaScript that you can step through in browser devtools to get as detailed as you want.

ayewo25d ago

Thanks for linking to it! Just in time for my own learner's journey.

storus26d ago· 2 in thread

Thanks for releasing this again! What are this year's changes to prior offerings?

marcelroed26d ago

Assignment 1 (basics) has the most hours of preparation invested in it, and only minor modernization/bug fixes were necessary this year.

5555watch26d ago

How are you grading the student submissions? Also, do you catch students who fully use AI and don't follow the Honor code? If so, how?

1 more reply

meken26d ago· 1 in thread

[1] https://cs224d.stanford.edu

egl202026d ago

Similar thoughts here. That was when I realized the potential of the Internet: I didn't have to be a grad student at a tier 1 research university to learn about the frontier.

chainsaw1026d ago· 1 in thread

I’m intrigued by this course. However I’m also curious about its prerequisite:

> Machine Learning (e.g. CS221, CS229, CS230, CS124, CS224N) You should be comfortable with the basics of machine learning and deep learning.

Anyone have a good implementation-heavy self-study resource for those topics, or experience with the recorded lectures for those Stanford courses?

alec_heif26d ago

Course: https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1246...

Lecture videos: https://www.youtube.com/playlist?list=PLoROMvodv4rOaMFbaqxPD...

Textbook: https://web.stanford.edu/~jurafsky/slp3/

wandering-nomad25d ago· 1 in thread

Can the assignments be done on MBP M5 Max? I am hoping to get my hands on one in next couple of weeks and really want to pursue this course

jvican25d ago

Most of them can, but assignment 2 requires Nvidia GPUs as it heavily depends on you getting acquainted with Triton, Nsight compute and other GPU low-level programming tools.

armas26d ago· 1 in thread

I independently worked on the first two assignments over the course of a year. I learned so much! I was wondering what other courses people took on afterwards :)

sonabinu26d ago

I’m watching the Frontier Systems videos - https://cs153.stanford.edu/

The one issue I had with the CS336 course was the delivery of the RL components. I liked Lectures 5 & 6 from CME 295 better

https://cme295.stanford.edu/syllabus/

I’ve heard good things about the diffusion models class as well - CME 296. Seems like a good next step.

https://cme296.stanford.edu/syllabus/

delis-thumbs-7e26d ago· 1 in thread

I’d love to do this, but I’m afraid I still lack some of the required skills. But perhaps one day!

jimbokun26d ago

Then start with one of the pre-req courses.

tevlon26d ago

The code is here: https://github.com/epoyraz/modded-gpt-1 But, you can also just ask Claude 4.8 or Codex 5.5

sonabinu26d ago

lblock25d ago

Oarch26d ago

Oh this is brilliant, I've spent the last month doing something just like this. As a challenge, no libraries allowed besides Python standard libs (so no numpy).

Started with Word2Vec, built an RNN, then LSTM and am halfway through building transformer architecture.

ChrisArchitect26d ago

AI Agent Guidelines for CS336 at Stanford https://github.com/stanford-cs336/assignment1-basics/blob/ma... (https://news.ycombinator.com/item?id=48359232)

artemonster26d ago

I wish there was an option to render "executable" lectures as PDF too. Id love to scroll these while commuting to-from work

netheril9625d ago

Is it possible to complete assignments with AMD GPUs? Preferably on Windows.

j / k navigate · click thread line to collapse