undefined | Better HN

0 pointsjiggawatts2y ago0 comments

H100 + quantisation + algorithmic improvements would be sufficient to explain the speed boost.

If you "have enough compute" available -- which OpenAI definitely does -- the best current technique is to use mixed precision with post-quantisation fine tuning to restore performance. That's most probably how all of the "turbo" models work. Take a model that was initially 16 or 32 bits per parameter during training, quantise it down to a mixture of 4, 8, and 16 bits, and then fix it up with an additional training pass that uses the original full-fat model's predictions as the loss function. With access to the raw parameters, it's possible to do this training such that all of the output weights are considered and adjusted during this phase instead of just the top word. Third parties fine-tuning against GPT4 chats can't do this, even with the collected samples, because they only have individual selected tokens/words instead of the full probability distribution.

0 comments

2 comments · 1 top-level

kridsdale32y ago· 1 in thread

To use a graphical term, would it be fair to call that process "dithering"?

jiggawattsOP2y ago

Closer to the quantization as seen in JPEG compression.

j / k navigate · click thread line to collapse