undefined | Better HN

0 pointsrthnbgrredf1y ago0 comments

I'm still not convinced that this isn't a tokenizer issue.

Were you able to find a substantial number of questions that do not fall into the letter countinh or word shuffling domsin - problems that are clearly unrelated to the fundamental tokenizer issue of modern LLMs? Otherwise, I would argue that your paper simply proves that the issue still exists.

0 comments

1 comments · 1 top-level

enum1y ago

It’s not that the benchmark is hard, but that the reasoning models do so much better than the non-reasoning models. That suggests it is testing a capability that reasoning models have that non-reasoning models do not.

Getting to 100% may require tokenization innovation, sure.

j / k navigate · click thread line to collapse