undefined | Better HN

story

0 pointsvisarga1mo ago0 comments

My own approach also has intent sitting at the top: intent justifies plan justifies code justifies tests. And the other way around, tests satisfy code, satisfy plan, satisfy intent. These threads bottom up and top down are validated by judge agents.

I also make individual tasks md files (task.md) which makes them capable of carrying intent, plan, but not just checkbox driven "- [ ]" gates, they get annotated with outcomes, and become a workbook after execution. The same task.md is seen twice by judge agents which run without extra context, the plan judge and the implementation judge.

I ran tests to see which component of my harness contributes the most and it came out that it is the judges. Apparently claude code can solve a task with or without a task file just as well, but the existence of this task file makes plans and work more auditable, and not just for bugs, but for intent follow.

Coming back to user intent, I have a post user message hook that writes user messages to a project scoped chat_log.md file, which means all user messages are preserved (user text << agent text, it is efficient), when we start a new task the chat log is checked to see if intent was properly captured. I also use it to recover context across sessions and remember what we did last.

Once every 10-20 tasks I run a retrospective task that inspects all task.md files since last retro and judges how the harness performs and project goes. This can detect things not apparent in task level work, for example when using multiple tasks to implement a more complex feature, or when a subsystem is touched by multiple tasks. I think reflection is the one place where the harness itself and how we use it can be refined.

    claude plugin marketplace add horiacristescu/claude-playbook-plugin

    source at https://github.com/horiacristescu/claude-playbook-plugin/tree/main

0 comments

beshrkayali1mo ago

The hierarchy you describe (intent -> plan -> code -> tests) maps well to how Ossature works. The difference is that your approach builds scaffolding around Claude Code to recover structure that chat naturally loses, whereas Ossature takes chat out of the generation pipeline entirely. Specs are the source of truth before anything is generated, so there's no drift to compensate for, the audit and build plan handle that upfront.

The judge finding is interesting though. Right now verification during build for each task in Ossature is command-based, compile, tests, that kind of thing. A judge checking spec-to-code fidelity rather than (or maybe in addition to?) runtime correctness is worth thinking about.

visargaOP1mo ago

Yes, judges should not just look for bugs, they should also validate intent follow, but that can only happen when intent was preserved. I chose to save the user messages as a compromise, they are probably 10 or 100x smaller than full session. I think tasks themselves are one step lower than pure user intent. Anyway, if you didn't log user messages you can still recover them from session files if they have not been removed.

One interesting data point - I counted word count in my chat messages vs final code and they came out about 1:1, but in reality a programmer would type 10x the final code during development. From a different perspective I found I created 10x more projects since I relied on Claude and my harness than before. So it looks user intent is 10x more effective than manual coding now.

j / k navigate · click thread line to collapse

0 comments

beshrkayali1mo ago

visargaOP1mo ago

j / k navigate · click thread line to collapse