I built Magpie because I was tired of AI code reviewers being too "nice."
Most AI tools just say "LGTM" or nitpick formatting. To fix this, Magpie uses an adversarial approach: it spawns two different AI agents (e.g., a Security Expert and a Performance Critic) and forces them to debate your changes.
They don't just list bugs; they attack each other's arguments until they reach a consensus. This cuts down on hallucinations and lazy approvals.
Features:
Adversarial Debate: Watch Claude and GPT-4o fight over your code.
Local & CI: Works on local files or GitHub PRs.
Model Agnostic: Supports OpenAI, Anthropic, and Gemini.
The Experiment: This is also an experiment in "coding without coding." I didn't write a single line of TypeScript for this project manually. The entire repo was built using Claude Code.
I'd love to hear your feedback—especially if you manage to make the models get into an infinite argument.
gdb showed that a critical pointer was garbage: 0x676974736e6f5373.
Usually, I’d suspect a race condition or a use-after-free. I stared at the hex for a while, checking for alignment issues or bit-flips, but it just looked like random entropy.
Out of frustration, I pasted the info locals dump into Gemini 3. I didn't ask it to fix the code, I just asked: "What do you see?"
It didn't try to analyze the C++ logic. Instead, it treated the address as data. It pointed out that on an x86-64 (Little Endian) system, 0x676974736e6f5373 decodes perfectly to the ASCII string: "sSonstig".
It clicked immediately. "Sonstig" is German for "Miscellaneous".
It turns out a legacy localization function was writing the category name "Sonstiges" into a stack buffer that was too small. It overflowed and perfectly overwrote the FiberManager pointer with the bytes of the word.
I think we often focus too much on LLMs for "Code Generation" (writing boilerplate). For me, the real killer feature is Pattern Recognition in raw data. I would have stared at that hex for hours seeing only noise; the model recognized the semantic meaning in milliseconds.
Has anyone else found LLMs useful specifically for decoding raw dumps or logs like this?
Six months in, the runtime performance is amazing, but our iteration speed is absolutely tanking.
It feels like we are paying a massive tax on every single feature. Just yesterday, I wasted an entire afternoon fighting CMake just to link a library that would have been a one-line go get or npm install in any other ecosystem. We also constantly deal with phantom bugs that turn out to be subtle ABI mismatches between our M1 Macs and the Linux CI runners—issues that simply don't exist in modern toolchains.
It’s frustrating because our "slower" competitors are shipping features weekly while we are stuck debugging linker errors or waiting for 20-minute clean builds.
I'm starting to wonder if the "performance moat" is a trap. For those who recently started infra projects: did you stick with C++? Did you bail for Rust/Go? Or do you just accept that velocity will be terrible in exchange for raw speed?
The Context: We maintain a distributed stateful engine (think search/analytics). The architecture is standard: a Control Plane (Coordinator) assigns data segments to Worker Nodes. The workload involves heavy use of mmap and lazy loading for large datasets.
The Incident: We had a cascading failure where the Coordinator got stuck in a loop, DDOS-ing a specific node.
The Signal: Coordinator sees Node A has significantly fewer rows (logical count) than the cluster average. It flags Node A as "underutilized."
The Action: Coordinator attempts to rebalance/load new segments onto Node A.
The Reality: Node A is actually sitting at 197GB RAM usage (near OOM). The data on it happens to be extremely wide (fat rows, huge blobs), so its logical row count is low, but physical footprint is massive.
The Loop: Node A rejects the load (or times out). The Coordinator ignores the backpressure, sees the low row count again, and retries immediately.
The Core Problem: We are trying to write a "God Equation" for our load balancer. We started with row_count, which failed. We looked at disk usage, but that doesn't correlate with RAM because of lazy loading.
Now we are staring at mmap. Because the OS manages the page cache, the application-level RSS is noisy and doesn't strictly reflect "required" memory vs "reclaimable" cache.
The Question: Attempting to enumerate every resource variable (CPU, IOPS, RSS, Disk, logical count) into a single scoring function feels like an NP-hard trap.
How do you handle placement in systems where memory usage is opaque/dynamic?
Dumb Coordinator, Smart Nodes: Should we just let the Coordinator blind-fire based on disk space, and rely 100% on the Node to return hard 429 Too Many Requests based on local pressure?
Cost Estimation: Do we try to build a synthetic "cost model" per segment (e.g., predicted memory footprint) and schedule based on credits, ignoring actual OS metrics?
Control Plane Decoupling: Separate storage balancing (disk) from query balancing (mem)?
Feels like we are reinventing the wheel. References to papers or similar architecture post-mortems appreciated.