undefined | Better HN

0 pointslambda11d ago0 comments

Not a harness issue. The harness (pi in my case) passes back the cot for all previous turns.

The jinja template is what renders the openai-format request sent by the harness, into the actual string of text that will be tokenized and fed to the model. For models without preserve thinking support, the jinja template drops the reasoning from all but the current turn.

Here is the default jinja for Gemma 4: https://huggingface.co/google/gemma-4-31B-it/blob/main/chat_...

    {#- Render reasoning/reasoning_content as thinking channel -#}
    {%- set thinking_text = message.get('reasoning') or message.get('reasoning_content') -%}
    {%- if thinking_text and loop.index0 > ns_turn.last_user_idx and message.get('tool_calls') -%}
        {{- '<|channel>thought\n' + thinking_text + '\n<channel|>' -}}
    {%- endif -%}

You see that it only preserves the thinking for indexes that are later than the last user message; thinking is only preserved for a single turn (which can include a lot of interleaved thinking and tool calls), once it goes back to the user and the user replies, it will replay the tool calls but not the thinking between them.

Here's Qwen 3.6 by comparison: https://huggingface.co/Qwen/Qwen3.6-35B-A3B/blob/main/chat_t...

        {%- if (preserve_thinking is defined and preserve_thinking is true) or (loop.index0 > ns.last_query_index) %}
            {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content + '\n</think>\n\n' + content }}
        {%- else %}
            {{- '<|im_start|>' + message.role + '\n' + content }}
        {%- endif %}

It additionally has a preserve_thinking flag that you can set. If that's set, it will include all turns thinking in the text passed to the model. But you do have to set that, it's not the default.

It's possible to modify the jinja file that you're using with a model. Some people do that with models that haven't been specifically trained for it, and report good results; but some report that because it wasn't trained for that, they get worse results if they include thinking from previous turns.

So for models like Gemma, you would have to modify the default jinja to enable this. For Qwen, you can just set the preserve_thinking flag to get this behavior; and apparently they have trained in this mode so you get better results than models that have not trained this way.

0 comments

2 comments · 1 top-level

carterschonwald8d ago· 1 in thread

my heavily modified test bed for of oh my pi fixes this

lambdaOP8d ago

How could the harness fix this? It's the jinja template used by the inference engine to render the API requests into the raw text that gets tokenized and completed by the model. Unless you're using something like the raw completions API instead of the `/v1/chat/completions` API, and effectively applying the template yourself. In which case, you could also just modify the jinja template on your server.

Anyhow, I've heard mixed results on any method of supplying reasoning traces beyond the current turn to models not trained on them. For some models, I've heard that it works fine this way, for others I've heard it degrades performance. But I don't know of anyone who has any kind of reliable benchmark for how well this works.

j / k navigate · click thread line to collapse

{#- Render reasoning/reasoning_content as thinking channel -#} {%- set thinking_text = message.get('reasoning') or message.get('reasoning_content') -%} {%- if thinking_text and loop.index0 > ns_turn.last_user_idx and message.get('tool_calls') -%} {{- '<|channel>thought\n' + thinking_text + '\n<channel|>' -}} {%- endif -%}