1
Interested in the engineering challenges behind agent systems: planning vs tool-calling, state management, failure recovery, and making LLM-driven workflows actually reliable in practice.
Building in public and learning by breaking things.
Demos are easy: a task finishes once the happy path works. Real systems are messier — partial failures, retries, idempotency, unclear terminal states, and the question of when to stop or escalate.
For people who’ve built schedulers, agents, or other long-running systems: how do you define "done" in practice? Is it a state machine, invariants, timeouts, external signals, or just operational heuristics?