What is the BENEFIT of all this?
Let's use Blockchain instead of a database - because we can.
Let's create a maze of microservices - because we can.
Let's make every function a lambda function - because we can.
Let's make AI write code, run it, verify it, fix it, then run it again - because we can.
Let's burn untold amounts of energy to do simple things - because we can.
What the article is proposing is making programming worse, for no apparent benefit for anyone except those who sell AI data center cycles.
Sure, every bit of f--ing around is research, but ROI is far from constant.
As an experiment, it's kind of cool. I'm kind of at a loss to what useful software you'd build with it though. Surely once you've run the AI function once it would be much simpler to cache the resulting code than repeatedly re-generate it?
Can anyone think of any uses for this?
[0] e.g. something like the below which I expect to use maybe a dozen times total.
Main routine: In folder X are a bunch of ROM files (iso, bin, etc) and a JSON file with game metadata for each. Look for missing entries, and call [subroutine] once per file (can be called in parallel). When done, summarise the results (successes/failures) based on the now updated metadata.
Subroutine: (...) update XYZ, use metacritic to find metadata, fall back to Google.
Surely, you'll run a function that does an AI call to cache the resulting code.
(I'll admit that I've built a few "applications" exploring interaction descriptions with our Design team that do exactly this - but they were design explorations that, in effect, used the LLM to simulate a back-end. Glorious, but not shippable.)
1. Confirmable, predictable behavior (can we test it, can we make assurances to customers?).
2. Comparative performance (having an LLM call to extract from a list in 100s of ms instead of code in <10ms).
3. Operating costs. LLM calls are spendy. Just think of them as hyper-unoptimized lossy function executors (along with being lossy encyclopedias), and the work starts to approach bogo algorithm levels of execution cost for some small problems.
Buuuuuut.... I had working functional prototype explorations with almost no work on my end, in an hour.
We've now extended this thinking to some experience exploration builders, so it definitely has a place in the toolbox.
Eventually, perhaps. I've yet to see a use case for blockchains that isn't merely a worse facsimile of something already existing.
But the electron was useless when it was discovered, so maybe one day
nobody except for maybe nasa would make software in this scenario.
The reason it hasn’t take off is that it’s a supremely bad and unmaintable idea. It also just doesn’t work very well because the LLM doesn’t have access to the rest of the codebase without an agentic loop to ground it.
> You write a Python function with a natural language specification instead of implementation code. You attach post-conditions – plain Python assertions that define what correct output looks like.
Vs
> You write a Python function with ~~a natural language specification instead of~~ implementation code.
In many cases.
It might even be fun that the first call generates python (or other langauge), and then subsequent calls go through it. This "otpimized" or "compiled" natural langauge is "LLMJitted" into python. With interesting tooling, you could then click on the implementation and see the generated cod, a bit like looking at the generated asssembly. Usually you'd just write in some hybrid pytnon + natural language, but have the ability to look deeper.
I can also imagine some additional tooling that keeps track of good implementations of ideas that have been validated. This could extend to the community. Package manager. Through in TRL + web of tust and... this could be wild.
Really tricky functions that the LLM can't solve could be delegated back for human implementation.
In my experience it’s a huge leap in terms of the agent being able to test and debug functionality. It’ll often write small code snippets to test that individual functions work as expected.
I’m not just making this stuff up of course, got the idea yesterday after reading Karpathy’s tweet about Nanoclaws contribution model (don’t submit PRa with features, submit PRs that tell an llm how to modify the program). Now I can’t concentrate on my day job. Can’t stop thinking about my little elixir beam project.
However, I do resonate somewhat with the post if I think about some accounting processes.
Accounting is where I came from, and a lot of data processing we do is mostly determinstic, with some "smartness" or judgement sprinkled in. Take for example bank reconciliation, the basic process is to match bank statement lines with accounting entry lines. In practice, dates, descriptions, and amounts often mismatch between the 2 for various reasons (typos, grouped bookings, value date vs transaction date differences, truncated values). This impacts a lot of SME's and these basic accounting processes are still manual as you need eyeballing. You look at a typical back office excel spreadsheet and will understand this.
You can pre-program the matching rules up to a certain point until it becomes unmaintainable. Or you can use LLM to generate data-dependent matching logic on the fly. I think there is a space for the latter approach, if we keep the scope tight and well contained. As with all engineering, it's about the trade-offs.
Useful targets for LLM to generate can be subsets of sql statements (create views and selects) or pure functions (haskell?), where side effects are strictly limited and there is only data in - data out. I am toying with SQL idea myself (GH: https://github.com/spoj/taskgraph).
https://kylekukshtel.com/incremental-determinism-heisenfunct...
A lot of this was also inspired by Ian Bicking's work here:
With AI Functions and post-conditions, we want to make this process more robust, ergonomic and cheaper: you don't always need a frontier model for ambiguous tasks. Smaller/faster agents can do the work if you have robust correctness checks.
On the roadmap: JIT-compiled functions that reuse previously generated code to cut costs, LLM-based backprop for learning/memory/prompt tuning, and strong remote sandboxing for code execution. We're focused on getting the DevX right before shipping these — happy to answer questions.
I'm sure there's a lot of effort put into this, god knows why, but I pray I never have to have this in a production environment im on.
https://github.com/Gabriella439/grace
It's still probably not a great idea.
These attempts at generating code that adheres to a whatever spec in Python of all languages are futile and just please investors.
There is a reason that really proving adherence to a spec or making arguments that the spec is reasonable in the first place is hard.
But hey, thinking is hard, let's go AI shopping.
For example, connecting to endpoints, etc... then the logic of your script can run.
nah, I'm skipping this update.