I can't say anything about the OP method, but I already tested the smol-IQ2_XS quant (which has 2.46 BPW) with the pi harness. I did not do a very long session because token generation and prompt processing gets very slow, but I think I worked for up to ~70k context and it maintained a lot of coherence in the session. IIRC the GPQA diamond is supposed to exercise long chains of thought and it scored exceptionally well with 82% (the original BF16 official number is 88%:
https://huggingface.co/Qwen/Qwen3.5-397B-A17B).
Note that not all quants are the same at a certain BPW. The smol-IQ2_XS quant I linked is pretty dynamic, with some tensors having q8_0 type, some q6_k and some q4_k (while the majority is iq2_xs). In my testing, this smol-IQ2_XS quant is the best available at this BPW range.
Eventually I might try a more practical eval such as terminal bench.