undefined | Better HN

0 pointspbhjpbhj2y ago0 comments

In text, we don't often count in series, and it seems likely that we often choose a non-counting sequence: like 'I chose options 1, 2, 7' or 'my code was 0 1 2 5', whatever.

Unless training included line-level skips, rather than just next-word skips (like word2vec) or concept-level associations? At the line level, or paragraph level, ordered numerical sequences are obviously very common in formal texts or in code.

I've seen sentence based training, I suppose for code (which it seems GPT4 excells at) line-level training would be essential.

Anyone recommend a mid-level read on this covering different modes of training and such; I'm happy with a bit of code and undergrad level maths. Thanks.

0 comments

2 comments · 2 top-level

jw12242y ago

> Unless training included line-level skips

Yes, of course — GPT-4 was trained on all common character sequences, including linebreaks and other invisible characters.

You can see how it works here: https://platform.openai.com/tokenizer

Nonetheless it doesn't need to have seen examples of line-level counting before. The "concept-level associations" you mentioned are an emergent property of the model, it forms its own concept-level associations as a result of being trained on such a massive dataset. It's what enables it to output original content which has never been seen before.

Zambyte2y ago

Maybe we count in series a lot more than you think.

https://www.youtube.com/watch?v=WO2X3oZEJOA

j / k navigate · click thread line to collapse