What it does:
- Generates chain-of-thought reasoning traces from any LLM
- Uses counterfactual analysis to measure impact of each reasoning step
- Identifies critical sentences that make-or-break task completion
- Exports semantic embeddings for clustering analysis
- Provides systematic failure mode categorization
Example use case:
I used PTS to compare Qwen3-0.6B vs DeepSeek-R1-Distill-1.5B on math problems and discovered they have fundamentally different reasoning architectures:
- DeepSeek: concentrated reasoning (fewer, high-impact steps)
- Qwen3: distributed reasoning (impact spread across multiple steps)
Quick start:
# Generate thought anchors
pts run --model="your-model" --dataset="gsm8k" --generate-thought-anchors
# Export for analysis
pts export --format="thought_anchors" --output-path="analysis.jsonl"
The library implements the thought anchors methodology from Bogdan et al. (2025) with extensions for:
- Comprehensive metadata collection
- 384-dimensional semantic embeddings
- Causal dependency tracking
- Systematic failure analysis
Why this matters: Most interpretability tools focus on individual tokens or attention patterns. Thought anchors operate at the sentence level, revealing which complete reasoning steps actually matter for getting correct answers.
Limitations: Currently focused on mathematical reasoning tasks. Planning to extend to other domains and larger models.
Links:
- GitHub: https://github.com/codelion/pts
- Research example: https://huggingface.co/blog/codelion/understanding-model-rea...
- Generated datasets: Available on HuggingFace
Would appreciate feedback on extending this to other reasoning domains or interpretability approaches.
Google's recent Gemini 2.5 report introduced Deep Think - a technique where models generate multiple hypotheses in parallel and critique them before arriving at final answers. It achieves SOTA results on math olympiads and competitive coding benchmarks.
The plugin works by modifying the inference pipeline to explore multiple solution paths simultaneously, then synthesizing the best approach. Instead of single-pass generation, the model essentially runs an internal debate before responding.
Technical details:
- Works with any model that supports structured reasoning patterns
- Implements parallel thinking during response generation - Particularly effective for complex reasoning tasks, math, and coding problems
- Increases inference time but significantly improves answer quality
Link: https://github.com/codelion/optillm/tree/main/optillm/plugin...
Demo: https://www.youtube.com/watch?v=b06kD1oWBA4
The implementation won the Cerebras & OpenRouter Qwen 3 Hackathon, but more importantly, it's now available for anyone running local models.
Questions for HN:
- Has anyone tried similar parallel reasoning approaches with local models?
- What other proprietary techniques do you think would be valuable to open-source?
- Any suggestions for optimizing the performance trade-offs?
The goal is to democratize advanced reasoning capabilities that were previously locked behind APIs. Would love feedback on the approach and ideas for improvements.