Joel Spolsky, stackoverflow.com founder, Talk at Yale: Part 1 of 3 https://www.joelonsoftware.com/2007/12/03/talk-at-yale-part-...
That's still the best way to turn a spec into a program and comes with all the downsides it entails.
First, collect the following information from user: …. Second, send http request to the following endpoint with the certain payload…. If server returned error - report back to user.
It makes me crack every time I see that kind of stuff. Why on Earth you won’t just write a script for that purpose? 10x faster, zero tokens burned, 100% deterministic.
- Because your bash-fu may not be good enough
- Because parts of the process may not be amenable to scripting, especially if they require LLMs
- Because the inputs to some steps are fuzzy enough that only an LLM can handle them
- etc...
That being said, yes, anything amenable to being turned into scripts should be.
import Mathlib
def Goldbach := ∀ x : ℕ, Even x → x > 2 → ∃ (y z: ℕ), Nat.Prime y ∧ Nat.Prime z ∧ x = y + z
A short specification for the proof of the Goldbach conjecture in Lean. Much harder to implement though. Implementation details are always hidden by the interface, which makes it easier to specify than produce. The Curry-Howard correspondence means that Joel's position here is that any question is as hard to ask as answer, and any statement as hard to formulate as it is to prove, which is really just saying that all describable statements are true. theorem goldbach : Goldbach := *message truncated*Advice given to Henry Ford’s lawyer, Horace Rackam, by an unnamed president of Michigan Savings Bank in 1903.
The idea, IIUC, seems to be that instead of directly telling an LLM agent how to change the code, you keep markdown "spec" files describing what the code does and then the "codespeak" tool runs a diff on the spec files and tells the agent to make those changes; then you check the code and commit both updated specs and code.
It has the advantage that the prompts are all saved along with the source rather than lost, and in a format that lets you also look at the whole current specification.
The limitation seems to be that you can't modify the code yourself if you want the spec to reflect it (and also can't do LLM-driven changes that refer to the actual code), and also that in general it's not guaranteed that the spec actually reflects all important things about the program, so the code does also potentially contain "source" information (for example, maybe your want the background of a GUI to be white and it is so because the LLM happened to choose that, but it's not written in the spec).
The latter can maybe be mitigated by doing multiple generations and checking them all, but that multiplies LLM and verification costs.
Also it seems that the tool severely limits the configurability of the agentic generation process, although that's just a limitation of the specific tool.
Eventually, we'll end up in a world where humans don't need to touch code, but we are not there yet. We are looking into ways to "catch up" the specs with whatever changes happen in the code not through CodeSpeak (agents or manual changes or whatever). It's an interesting exercise. In the case of agents, it's very helpful to look at the prompts users gave them (we are experimenting with inspecting the sessions from ~/.claude).
More generally, `codespeak takeover` [1] is a tool to convert code into specs, and we are teaching it to take prompts from agent sessions into account. Seems very helpful, actually.
I think it's a valid use case to start something in vibe coding mode and then switch to CodeSpeak if you want long-term maintainability. From "sprint mode" to "marathon mode", so to speak
Will we though? Wouldn't AI need to reach a stage where it is a tool, like a compiler, which is 100% deterministic?
It also seems to be closed-source, which means that unless they open the source very soon it will very likely be immediately replaced in popularity by an open source version if it turns out to gain traction.
Cool idea overall, an incremental psuedocode compiler. Interesting to see how well it scales.
I can also see a hybrid solution with non-specced code files for things where the size of code and spec would be the same, like for enums or mapping tables.
Working on that as well. We need to be a lot more flexible and configurable
* This isn't a language, it's some tooling to map specs to code and re-generate
* Models aren't deterministic - every time you would try to re-apply you'd likely get different output (without feeding the current code into the re-apply and let it just recommend changes)
* Models are evolving rapidly, this months flavour of Codex/Sonnet/etc would very likely generate different code from last months
* Text specifications are always under-specified, lossy and tend to gloss over a huge amount of details that the code has to make concrete - this is fine in a small example, but in a larger code base?
* Every non-trivial codebase would be made up of of hundreds of specs that interact and influence each other - very hard (and context - heavy) to read all specs that impact functionality and keep it coherent
I do think there are opportunities in this space, but what I'd like to see is:
* write text specifications
* model transforms text into a *formal* specification
* then the formal spec is translated into code which can be verified against the spec
2 and three could be merged into one if there were practical/popular languages that also support verification, in the vain of ADA/Spark.
But you can also get there by generating tests from the formal specification that validate the implementation.
If the result is always provably correct it doesn't matter whether or not it's different at the code level. People interested in systems like this believe that the outcome of what the code does is infinity more important than the code itself.
Since nobody involved actually cares whether the code works or not, it doesn't matter whether it's a different wrong thing each time.
If the spec is so complete that it covers everything, you might as well write the code.
The benefit of writing a spec and having the LLM code it, is that the LLM will fill in a lot of blanks. And it is this filling in of blanks that is non-deterministic.
- I bootstrap AGENTS.md with my basic way of working and occasionally one or two project specific pieces
- I then write a DESIGN.md. How detailed or well specified it is varies from project to project: the other day I wrote a very complete DESIGN.md for a time tracking, invoice management and accounting system I wanted for my freelance biz. Because it was quite complete, the agent almost one-shot the whole thing
- I often also write a TECHNICAL-SPEC.md of some kind. Again how detailed varies.
- Finally I link to those two from the AGENTS. I also usually put in AGENTS that the agent should maintain the docs and keep them in sync with newer decisions I make along the way.
This system works well for me, but it's still very ad hoc and definitely doesn't follow any kind of formally defined spec standard. And I don't think it should, really? IMO, technically strict specs should be in your automated tests not your design docs.
I found it works very well in once-off scenarios, but the specs often drift from the implementation. Even if you let the model update the spec at the end, the next few work items will make parts of it obsolete.
Maybe that's exactly the goal that "codespeak" is trying to solve, but I'm skeptical this will work well without more formal specifications in the mix.
I have the same basic workflow as you outlined, then I feed the docs into blackbird, which generates a structured plan with task and sub tasks. Then you can have it execute tasks in dependency order, with options to pause for review after each task or an automated review when all child task for a given parents are complete.
It’s definitely still got some rough edges but it has been working pretty well for me.
Is that really true? I haven’t tried to do my own inference since the first Llama models came out years ago, but I am pretty sure it was deterministic: if you fixed the seed and the input was the same, the output of the inference was always exactly the same.
1.) There is typically a temperature setting (even when not exposed, most major providers have stopped exposing it [esp in the TUIs]).
2.) Then, even with the temperature set to 0, it will be almost deterministic but you'll still observe small variations due to the limited precision of float numbers.
Edit: thanks for the corrections
So like when you give the same spec to 2 different programmers.
I use Kiro IDE (≠ Kiro CLI) primarily as a spec generator. In my experience, it's high-quality for creating and iterating on specs. Tools like Cursor are optimized for human-driven vibing -- they have great autocomplete, etc. Kiro, by contrast, is optimized around spec, which ironically has been the most effective approach I've found for driving agents.
I'd argue that Cursor, Antigravity, and similar tools are optimized for human steering, which explains their popularity, while Kiro is optimized for agent harnesses. That's also why it’s underused: it's quite opinionated, but very effective. Vibe-coding culture isn't sold on spec driven development (they think it's waterfall and summarily dismiss it -- even Yegge has this bias), so people tend to underrate it.
Kiro writes specs using structured formats like EARS and INCOSE (which is the spc format used in places like Boeing for engineering reqs). It performs automated reasoning to check for consistency, then generates a design document and task list from the spec -- similar to what Beads does. I usually spend a significant amount of time pressure-testing the spec before implementing (often hours to days), and it pays off. Writing a good, consistent spec is essentially the computer equivalent of "writing as a tool of thought" in practice.
Once the spec is tight, implementation tends to follow it closely. Kiro also generates property-based tests (PBTs) using Hypothesis in Python, inspired by Haskell's QuickCheck. These tests sweep the input domain and, when combined with traditional scenario-based unit tests, tend to produce code that adheres closely to the spec. I also add a small instruction "do red/green TDD" (I learned this from Simon Willison) and that one line alone improved the quality of all my tests. Kiro can technically implement the task list itself, but this is where agents come in. With the spec in hand, I use multiple headless CLI agents in tmux (e.g., Kiro CLI, Claude Code) for implementation. The results have been very good. With a solid Kiro spec and task list, agents usually implement everything end-to-end without stopping -- I haven’t found a need for Ralph loops. (agents sometimes tend to stop mid way on Claude plans, but I've never had that happen with Kiro, not sure why, maybe it's the checklist, which includes PBT tests as gates).
didn't have the strongest start, but the Kiro IDE is one of the best spec generators I've used, and it integrates extremely well with agent-driven workflows.
>* write text specifications
>* model transforms text into a formal specification
>* then the formal spec is translated into code which can be verified against the spec
This skill does just that: https://github.com/doubleuuser/rlm-workflow
Each stage produces its own output artifact (analysis, implementation plan, implementation summary, etc) and takes the previous phases' outputs as input. The artifact is locked after the stage is done, so there is no drift.
formal specification is no different from code: it will have bugs :)
There's no free lunch here: the informal-to-formal transition (be it words-to-code or words-to-formal-spec) comes through the non-deterministic models, period.
If we want to use the immense power of LLMs, we need to figure out a way to make this transition good enough
Slightly sarcastic but not sure this couldn't become a thing.
You're telling me that I should be doing the agonizing parts in order for the LLM to do the routine part (transforming a description of a program into a formal description of a program.) Your list of things that "make no sense" are exactly the things that I want the LLMs to do. I want to be able to run the same spec again and see the LLM add a feature that I never expected (and wasn't in the last version run from the same spec) or modify tactics to accomplish user goals based on changes in technology or availability of new standards/vendors.
I want to see specs that move away from describing the specific functionality of programs altogether, and more into describing a usefulness or the convenience of a program that doesn't exist. I want to be able to feed the LLM requirements of what I want a program to be able to accomplish, and let the LLM research and implement the how. I only want to have to describe constraints i.e. it must enable me to be able to do A, B, and C, it must prevent X,Y, and Z; I want it to feel free to solve those constraints in the way it sees fit; and when I find myself unsatisfied with the output, I'll deliver it more constraints and ask it to regenerate.
Be careful what you wish for. This sounds great in theory but in practice it will probably mean a migration path for the users (UX changes, small details changed, cost dynamics and a large etc.)
https://codespeak.dev/blog/greenfield-project-tutorial-20260...
It is a formal "way" aka like using json or xml like tons of people are already doing.
I'm not sure adding a more formal language interface makes sense, as these models are optimized for conversational fluency. It makes more sense to me for them to be given instructions for using more formal interfaces as needed.
I'm writing a language spec for an LLM runner that has the ability to chain prompts and hooks into workflows.
https://github.com/AlexChesser/ail
I'm writing the tool as proof of the spec. Still very much a pre-alpha phase, but I do have a working POC in that I can specify a series of prompts in my YAML language and execute the chain of commands in a local agent.
One of the "key steps" that I plan on designing is specifically an invocation interceptor. My underlying theory is that we would take whatever random series of prose that our human minds come up with and pass it through a prompt refinement engine:
> Clean up the following prompt in order to convert the user's intent > into a structured prompt optimized for working with an LLM > Be sure to follow appropriate modern standards based on current > prompt engineering reasech. For example, limit the use of persona > assignment in order to reduce hallucinations. > If the user is asking for multiple actions, break the prompt > into appropriate steps (**etc...)
That interceptor would then forward the well structured intent-parsed prompt to the LLM. I could really see a step where we say "take the crap I just said and turn it into CodeSpeak"
What a fantastic tool. I'll definitely do a deep dive into this.
I presume this is temporary since the project is still in alpha, but I'm curious why this requires use of an API at all and what's special about it that it can't leverage injecting the prompt into a Claude Code or other LLM coding tool session.
[0]: https://codespeak.dev/blog/greenfield-project-tutorial-20260...
And whatever codespeak offers is like a weird VCS wrapper around this. I can already version and diff my skills, plans properly and following that my LLM generated features should be scoped properly and be worked on in their own branches. This imo will just give rise to a reason for people to make huge 8k-10k line changes in a commit.
I'm still getting used to the idea that modern programs are 30 lines of Markdown that get the magic LLM incantation loop just right. Seems like you're in the same boat.
* Yes, this is a language, no its not a programming language you are used to, but a restricted/embellished natural language that (might) make things easier to express to an LLM, and provides a framework for humans who want to write specifications to get the AI to write code.
* Models aren't deterministic, but they are persistent (never gonna give up!). If you generate tests from your specification as well as code, you can use differential testing to get some measure (although not perfect) of correctness. Never delete the code that was generated before, if you change the spec, have your model fix the existing code rather than generate new code.
* Specifications can actually be analyzed by models to determine if they are fully grounded or not. An ungrounded specification is going to not be a good experience, so ask the model if it thinks your specification is grounded.
* Use something like a build system if you have many specs in your code repository and you need to keep them in sync. Spec changes -> update the tests and code (for example).
That works great in practice, Gherkin even has a markdown dialect [1].
If you combine it with a tool like aico [2] you can have a really effective development workflow.
[1] https://github.com/cucumber/gherkin/blob/main/MARKDOWN_WITH_...
Yes, and the implementation... no one actually cares about that. This would be a good outcome in my view. What I see is people letting LLMs "fill in the tests", whereas I'd rather tests be the only thing humans write.
There has been a profession in place for many decades that specifically addresses that...Software Engineering.
The demo I've briefly seen was very very far from being impressive.
Got rejected, perhaps for some excessive scepticism/overly sharp questions.
My scepticism remains - so far it looks like an orchestrator to me and does not add enough formalism to actually call it a language.
I think that the idea of more formal approach to assisted coding is viable (think: you define data structures and interfaces but don't write function bodies, they are generated, pinned and covered by tests automatically, LLMs can even write TLA+/formal proofs), but I'm kinda sceptical about this particular thing. I think it can be made viable but I have a strong feeling that it won't be hard to reproduce that - I was able to bake something similar in a day with Claude.
Definitely won't use it for prod ofc but may try it out for a side-project.
It seems that this is more or less:
- instead of modules, write specs for your modules
- on the first go it generates the code (which you review)
- later, diffs in the spec are translated into diffs in the code (the code is *not* fully regenerated)
this actually sounds pretty usable, esp. if someone likes writing. And wherever you want to dive deep, you can delve down into the code and do "microoptimizations" by rolling something on your own (with what seems to be called here "mixed projects").That said, not sure if I need a separate tool for this, tbh. Instead of just having markdown files and telling cause to see the md diff and adjust the code accordingly.
The other piece that has always struck me as a huge inefficiency with current usage of LLMs is the hoops they have to jump through to make sense of existing file formats - especially making sense of (or writing) complicated semi-proprietary formats like PDF, DOC(X), PPT(X), etc.
Long-term prediction: for text, we'll move away from these formats and towards alternatives that are designed to be optimal for LLMs to interact with. (This could look like variants of markdown or JSON, but could also be Base64 [0] or something we've not even imagined yet.)
https://www.zmescience.com/science/news-science/polish-effec...
Instead of imperatively letting the agents hammer your codebase into shape through a series of prompts, you declare your intent, observe the outcome and refine the spec.
The agents then serve as a control plane, carrying out the intent.
I'm hoping for a framework that expands upon Behavior Driven Development (BDD) or a similar project-management concept. Here's a promising example that is ripe for an Agentic AI implementation, https://behave.readthedocs.io/en/stable/philosophy/#the-gher...
i guess you can build a cli toolchain for it, but as a technique it’s a bit early to crystallize into a product imo, i fully expect overcoding to be a standard technique in a few years, it’s the only way i’ve been able to keep up with AI-coded files longer than 1500 lines
My quick understanding is that isn't really trying to utilize any formal specification but is instead trying to more-clearly map the relationship between, say, an individual human-language requirement you have of your application, and the code which implements that requirement.
We are putting people out of work. Why not employ MORE people to do LESS, by sharing the responsibility? A group activity, perhaps?
Eg make room in this spec > program development workflow for, say, ... Tech Writers. Add them to the development team to ensure the language is right for the LLM ahead of time!
I've had good success getting LLMs to write complicated stuff in haskell, because at the end of the day I am less worried about a few errant LLM lines of code passing both the type checking and the test suite and causing damage.
It is both amazing and I guess also not surprising that most vibe coding is focused on python and javascript, where my experience has been that the models need so much oversight and handholding that it makes them a simple liability.
The ideal programming language is one where a program is nothing but a set of concise, extremely precise, yet composable specifications that the _compiler_ turns into efficient machine code. I don't think English is that programming language.
There you have it: Code laundering as a service. I guess we have to avoid Kotlin, too.
This is the same issue I've had with ORMs - I get that they make it easier to generate functionality at speed, but ultimately I want control over the biggest performance lever I have available to me.
Is it a code generator tool from specs? Ugh. Why not push for the development of the protocol itself then?
LLMs works on both translation steps. But you end up with an healthy amount of tests.
I tagged each tests with the id of the spec so I do get spec to test coverage as well.
Beside standard code coverage given by the tests.
For now, it's only about test coverage of the code, but the spec coverage is coming too.
Good enough that I don't review it.
Granted, it is a personal project that I care only to the point that I want it to work. There are no money on the line. Nothing professional.
I believe that part of the secret is that I force CC to run the whole est suites after it change ANY file. Using hooks.
It makes iteration slower because it kinda forces it to go from green to green. Or better from red to less red (since we start in red).
But overall I am definitely happy with the results.
Again, personal projects. Not really professional code.
You write a markdown spec.
The script takes it and feeds it to an LLM API.
The API generates code.
Okay? Where is this "next-generation programming language" they talk about?
Or Lojban?
However, there is no case for more complicated, multi-file changes or architecture stuff.
Of course an expert would throw it out and design/write it properly so they know it works.
Also, the examples feel forced, as if you use external libraries, you don't have to write your own "Decode RFC 2047"
This feels wrong, as the spec doesn't consistently generate the same output.
But upon reflection, "source of truth" already refers to knowledge and intent, not machine code.
Actually, computers, being machines, do equate machine code and source of truth.
So for example, if you refactor a program, make the LLM do anything but keep the logic of the program intact.
> - Encoding auto-detection and normalization for beautifulsoup4
I was kinda expecting to see the name "chardet" pop up here. :-)
"[1] When computing LOC, we strip blank lines and break long lines into many"
I imagine this is before and after- not just after.
As in, they aren't just making lines long and removing whitespace (something models love to do when you ask it to remove lines of code)
If we look at the history of programming languages, we see the idea of Templating occuring over and over again, in different contexts, i.e., C's macros, C++ Templates, embedding PHP code snippets into an otherwise mostly HTML file, etc., etc.
Templating can involve aspects of meta-code (code about the code), interpretation proxying (which engine/compiler/system/parser/program/subsystem/? is responsible for interpreting a given section of text), etc., etc.
Here we see this idea as another level of proxied/layered abstraction/indirection, in this case between an AI/LLM and the underlying source code...
Is this a good idea?
Will all code be written like this, using this pattern or a similar one, in the future?
I for one don't know (it's too early to tell!) but one thing is for sure, and that's that this new "layer" certainly contains an interesting set of ideas!
I will definitely be watching to see more about how this pattern plays out in future software development...
...and I obviously asked Gemini about it and it replied:
"A language optimized exclusively for Large Language Model (LLM) efficiency would prioritize Token Density, Context Window Management, and Architectural Alignment. It would not be binary, as standard LLM architectures (Transformers) process discrete tokens from a predefined vocabulary, not raw bits."
Example of it:
Feature Human-Readable (Python/C++) LLM-Native (Hypothetical)
--------------------------------------------------------------------------
Logic if (x > 10) { return true; } ¿x10†
Memory int\* ptr = malloc(sizeof(int)); §m4
Tokens Used ~10-15 2-3...which seems to suggest that the authors themselves don't dogfood their own software. Please tell me that Codespeak was written entirely with Codespeak!
Instead of that json, which is so last year, why not use an agent to create an MD file to setup another agent, that will compile another MD file and feed it to the third agent, that... It is turtles, I mean agents, all the way down!
Instant tab close!
https://codespeak.dev/blog/greenfield-project-tutorial-20260...
Does this make it a 6th generation language?
`codespeak build` — takes the spec and turns it into code via LLM, like a non-deterministic compiler.
`codespeak takeover` — reads a file and creates a spec from it.
You can progressively opt in ("mixed mode") so it only touches files you allow it to (and makes new ones if needed).
Pros:
- Formalised version of the "agentic engineering" many are already doing, but might actually get people to store their specs and decisions in a concise way that seems more sane than committing your entire meandering chat session.
- Encouraging people to review spec and code side-by-side at a file level seems reasonable. Could even build an IDE/plugin around that concept to auto-load/navigate the spec and code side-by-side like their examples: https://codespeak.dev/shrink-factor/markitdown-eml. If tokens per second for popular models continues to improve, could even update the spec by hand and see the code regenerate live on the fly, perhaps via `codespeak watch`.
- Reduces the code you have to write by 5-10x. Largely by convincing you not to write it any more. Our graphics cards write the code for us in this timeline and many people are even happy about it.
- As models improve, could optionally re-run `build` against the same original spec. (Why do that if the output already produces the intended result and the test suite still passes? Presumably for simpler code. Or faster output. Or lower memory use. Or simply _different_ bugs.)
- Moves programming back toward structured thinking backed by a committed artifact and a solid two-word command you can run, instead of actively having conversations with far away GPUs like that's normal now.
- Could theoretically swap out the build target language if you grow to trust the build process to be your babelfish/specfish. Kind of Haxe with Markdown.
Cons:
- Seems to be gated by their login, can't bring your own model?
- Suspect the labs can all clone this concept very easily. `claude build` and `claude spec`?
The idea of a non-deterministic 'build' command had me cringing at first. But formalising a process many are using anyway that currently feels pretty sloppy perhaps isn't so terrible.
If nothing else, writing `build` is a lot quicker and maintains a whisker of self-respect. At least compared to typing, "please take this spec and adapt the Python accordingly" followed 2 minutes later by, "I updated the spec to deal with the edge-case you missed, try again but don't miss anything this time".
Programming is in the end math, the model is defined and, when done correctly follows common laws.
I know dark mode is really popular with the youngens but I regularly have to reach for reader mode for dark web pages, or else I simply cannot stand reading the contents.
Unfortunately, this site does not have an obvious way of reading it black-on-white, short of looking at the HTML source (CTRL+U), which - in fact - I sometimes do.
Sometimes a site will include a button or other UI element to choose a light theme but I find it odd that so many sites which are presumed to be designed by technically competent people, completely ignore accessibility concerns.
Definitely in the minority on this one as dark mode is really popular these days.
Really hard to describe how it is literally physically painful for my eyes. Very strange.
The site does describe it as a "programming language," which feels like a novel use of the term to me. The borders around a term like "programming language" are inherently fuzzy, but something like "code generation tool" better describes CodeSpeak IMHO.
Also, English is really too verbose and imprecise for coding, so we developed a programming language you can use instead.
Now, this gives me a business idea: are you tired of using CodeSpeak? Just explain your idea to our product in English and we'll generate CodeSpeak for you.
In the past maths were expressed using natural language, the math language exists because natural language isn't clear enough.
"In order to make machines significantly easier to use, it has been proposed (to try) to design machines that we could instruct in our native tongues. this would, admittedly, make the machines much more complicated, but, it was argued, by letting the machine carry a larger share of the burden, life would become easier for us. It sounds sensible provided you blame the obligation to use a formal symbolism as the source of your difficulties. But is the argument valid? I doubt."
"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."