ReasoningGym: Reasoning Environments for RL with Verifiable Rewards (opens in new tab)

(arxiv.org)

105 pointst551y ago28 comments

28 comments

19 comments · 5 top-level

phh1y ago· 7 in thread

Cool cool. I'm a bit put off by calling it "reasoning" /"thought". These RL targets can be achieved without "thinking" model but still cool. Gotta love the brainfuck task.

I personally think that Gemini 2.5 Pro's superiority comes from having hundreds or thousands RL tasks (without any proof whatsoever, so rather a feeling). So I've been wanting a "RL Zoo" for quite a while. I hope this project won't be a one-off and will be maintained long term with many external contributions to add new targets!

CuriouslyC1y ago

Gemini 2.5 Pro's superiority is IMO largely driven by their long context support and training methodology. Compare Gemini as a beta reader for a 100k token book with GPT4.1 or Claude 4, and it becomes quite clear how much more effectively it can reason across its context than other comparable models. This also makes it much better for architecting new features into a system, since you can load a lot of the current system into the context and it'll conform to existing styles and architecture patterns more closely.

jacob0191y ago

Agreed, 2.5 flash too. I analyze a large json document of metrics for pricing decisions. Typically around 200k, occtionallly up to 1M, Gemini 2.5 significantly outperforms for my task. It isn't 100%, but role playing gets close. I suppose that's a form of inference time compute.

t55OP1y ago

For a 100k token context window; all those models are comparable though

gemini 2.5 pro shines for 200k+ tokens

2 more replies

t55OP1y ago

> I personally think that Gemini 2.5 Pro's superiority comes from having hundreds or thousands RL tasks (without any proof whatsoever, so rather a feeling).

Given that GDM pioneered RL, that's a reasonable assumption

flowerthoughts1y ago

Assuming with GDM, you mean Google-Deep Mind. They pioneered RL with deep nets as policy function estimator. The deep nets being a result of CNNs and massive improvements in hardware parallelization at the time.

RL was established, at the latest, with Q-learning in 1989: https://en.wikipedia.org/wiki/Q-learning

1 more reply

olliestanley1y ago

We definitely plan to maintain the project for as long as there is interest in it. If you have ideas for new tasks, we'd always welcome contributions!

phh1y ago

Thanks for the answer! As a toy project I implemented wikiracing with trl. I'll probably try to PR that to your gym. (can't say that I managed to improve score with it though)

sadboots1y ago· 3 in thread

by the love of god, please stop overfitting on gsm8k

olliestanley1y ago

Difficult one. GSM8K and MATH evals (both reported in Reasoning Gym paper) are common in smaller model RL papers for a reason, which is that smaller models can get decent scores on them, unlike fresher & harder benchmarks.

Part of the aim of RG is to be used as a difficulty-adjustable & non-repeating eval though so if people think it's a good benchmark, perhaps it will allow this status quo to shift!

i5heu1y ago

It looks like your neural network is overfitted on seeing overfitt where is none.

Prejudices is a form of overfitting IMHO

t55OP1y ago

agree, the RG evals feel like a fresh breeze

ninakostoska1y ago· 2 in thread

Cool to see NVIDIA’s most recent reasoning model [1] already uses Reasoning Gymas a large part of their data mixture

[1] https://arxiv.org/abs/2505.24864

t55OP1y ago

> prolonged RL training can uncover novel reasoning strategies that are inaccessible to base models, even under extensive sampling

does this mean that previous RL papers claiming the opposite were possibly bottlenecked by small datasets?

yorwba1y ago

No, they do not point to any specific examples of novel reasoning strategies that were uncovered, nor is their sampling that extensive (at most 256 samples vs the 2048 used in https://limit-of-rlvr.github.io/ ).

2 more replies

jimmySixDOF1y ago· 2 in thread

RL is proving to be a weird science lately :

>Spurious Rewards: Rethinking Training Signals in RLVR ### *TL;DR* We show that you can do RLVR on Qwen2.5-Math models with *completely random or incorrect rewards*, and still get massive math benchmark gains.

All of the following spurious rewards give 15-20+ points on MATH-500 when RLVR training Qwen2.5-Math-7B:

- RLVR + format reward (reward responses with `\boxed{}`): *+16.4%* - RLVR + incorrect reward (only incorrect answers rewarded): *+24.6%* - RLVR + random reward: *+21.4%* - (as a reference) RLVR + ground-truth reward: + 28.8%

How can these spurious rewards possibly work? Can we get similar gains on other models with broken rewards?

>Learning to Reason without External Rewards Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. [2]

[1] https://rethink-rlvr.notion.site/Spurious-Rewards-Rethinking... [2] https://arxiv.org/abs/2505.19590

spmurrayzzz1y ago

I think the fact that spurious rewards were predominantly only effective for Qwen may suggest that it was triggering some shift in its language distribution. If you use those models long enough you'll see a ton of mandarin that makes its way into your outputs, and their logits tend to look more "confident" than the ones for english tokens.

So the reward value shifting may act as a sort of unintentional regularization technique (similar to adding noise to the discriminator input in GAN archs).

t55OP1y ago

yeah, RLVR is still nascent and hence there's lots of noise.

> How can these spurious rewards possibly work? Can we get similar gains on other models with broken rewards?

it's because in those cases, RLVR merely elicits the reasoning strategies already contained in the model through pre-training

this paper, which uses Reasoning gym, shows that you need to train for way longer than those papers you mentioned to actually uncover novel reasoning strategies: https://arxiv.org/abs/2505.24864

starzmustdie1y ago

GitHub: https://github.com/open-thought/reasoning-gym

j / k navigate · click thread line to collapse

28 comments

19 comments · 5 top-level

phh1y ago· 7 in thread

Cool cool. I'm a bit put off by calling it "reasoning" /"thought". These RL targets can be achieved without "thinking" model but still cool. Gotta love the brainfuck task.

CuriouslyC1y ago

jacob0191y ago

t55OP1y ago

For a 100k token context window; all those models are comparable though

gemini 2.5 pro shines for 200k+ tokens

2 more replies

t55OP1y ago

> I personally think that Gemini 2.5 Pro's superiority comes from having hundreds or thousands RL tasks (without any proof whatsoever, so rather a feeling).

Given that GDM pioneered RL, that's a reasonable assumption

flowerthoughts1y ago

RL was established, at the latest, with Q-learning in 1989: https://en.wikipedia.org/wiki/Q-learning

1 more reply

olliestanley1y ago

We definitely plan to maintain the project for as long as there is interest in it. If you have ideas for new tasks, we'd always welcome contributions!

phh1y ago

Thanks for the answer! As a toy project I implemented wikiracing with trl. I'll probably try to PR that to your gym. (can't say that I managed to improve score with it though)

sadboots1y ago· 3 in thread

by the love of god, please stop overfitting on gsm8k

olliestanley1y ago

Part of the aim of RG is to be used as a difficulty-adjustable & non-repeating eval though so if people think it's a good benchmark, perhaps it will allow this status quo to shift!

i5heu1y ago

It looks like your neural network is overfitted on seeing overfitt where is none.

Prejudices is a form of overfitting IMHO

t55OP1y ago

agree, the RG evals feel like a fresh breeze

ninakostoska1y ago· 2 in thread

Cool to see NVIDIA’s most recent reasoning model [1] already uses Reasoning Gymas a large part of their data mixture

[1] https://arxiv.org/abs/2505.24864

t55OP1y ago

> prolonged RL training can uncover novel reasoning strategies that are inaccessible to base models, even under extensive sampling

does this mean that previous RL papers claiming the opposite were possibly bottlenecked by small datasets?

yorwba1y ago

2 more replies

jimmySixDOF1y ago· 2 in thread

RL is proving to be a weird science lately :

All of the following spurious rewards give 15-20+ points on MATH-500 when RLVR training Qwen2.5-Math-7B:

How can these spurious rewards possibly work? Can we get similar gains on other models with broken rewards?

[1] https://rethink-rlvr.notion.site/Spurious-Rewards-Rethinking... [2] https://arxiv.org/abs/2505.19590

spmurrayzzz1y ago

So the reward value shifting may act as a sort of unintentional regularization technique (similar to adding noise to the discriminator input in GAN archs).

t55OP1y ago

yeah, RLVR is still nascent and hence there's lots of noise.

> How can these spurious rewards possibly work? Can we get similar gains on other models with broken rewards?

it's because in those cases, RLVR merely elicits the reasoning strategies already contained in the model through pre-training

this paper, which uses Reasoning gym, shows that you need to train for way longer than those papers you mentioned to actually uncover novel reasoning strategies: https://arxiv.org/abs/2505.24864

starzmustdie1y ago

GitHub: https://github.com/open-thought/reasoning-gym

j / k navigate · click thread line to collapse