The high level declarative nature and type driven development style of languages like Haskell also make it really easy for an experienced developer to review and validate the output of the LLM.
Early on in the GPT era I had really bad experiences generating Haskell code with LLMs but I think that the combination of improved models, increased context size, and agentic tooling has allowed LLMs to really take advantage of functional programming.
I've also seen multiple startups that have had some pretty impressive performance with Lean and Rocq.
My current theory is that as long as the LLM has sufficiently good baseline performance in a language, the kind of scaffolding and tooling you can build around the pure code generation will have an outsize effect, and languages with expressive type systems have a pretty direct advantage there: types can constrain and give immediate feedback to your system, letting you iterate the LLM generation faster and at a higher level than you could otherwise.
I recently saw a paper[1] about using types to directly constrain LLM output. The paper used TypeScript, but it seems like the same approach would work well with other typed languages as well. Approaches like that make generating typed code with LLMs even more promising.
Abstract:
> Language models (LMs) can generate code but cannot guarantee its correctness often producing outputs that violate type safety, program invariants, or other semantic properties. Constrained decoding offers a solution by restricting generation to only produce programs that satisfy user-defined properties. However, existing methods are either limited to syntactic constraints or rely on brittle, ad hoc encodings of semantic properties over token sequences rather than program structure.
> We present ChopChop, the first programmable framework for constraining the output of LMs with respect to semantic properties. ChopChop introduces a principled way to construct constrained decoders based on analyzing the space of programs a prefix represents. It formulates this analysis as a realizability problem which is solved via coinduction, connecting token-level generation with structural reasoning over programs. We demonstrate ChopChop's generality by using it to enforce (1) equivalence to a reference program and (2) type safety. Across a range of models and tasks, ChopChop improves success rates while maintaining practical decoding latency.
Actually, Haskell was a bit too hard for me on my own for real projects. Now with AI assistants, I think it could be a great pick.
I think the underlying reason is that functional programming is very conducive to keeping the context tight and focused. For instance, most logic relevant to a task tends to be concentrated in a few functions and data structures across a smallish set of files. That's all you need to feed into the context.
Contrast that with say, Java, where the logic is often spread across a deep inheritance hierarchy located in bunch of separate files. Add to that large frameworks that encapsulate a whole lot of boilerplate and bespoke logic with magic being injected from arbitrary places via e.g. annotations. You'd need to load all of those files (or more likely, simply the whole codebase) and relevant documentation to get accurate results. And even then the additional context is not just extraneous and expensive, but also polluted with irrelevant data that actually reduces accuracy.
A common refrain of mine is that for the best results, you have to invest a lot of time experimenting AND adapt yourself to figure out what works best with AI. In my case, it was gradually shifting to a functional style after spending my whole career writting OO code.
Seeing all the C languages and JavaScript at the bottom like this makes me wonder if it's not just that Curly brackets take a lot of tokens.
for (int index = 0; index < size; ++index)
instead of for index in 0...size
eats up a lot of tokens, especially in C where you also need this construct for iterating over arrays.`public` might have a token by itself, even though you can have `pub` occurring in other contexts, too.
But had never considered that a programming language might be created thats less human readable/auditable to enable LLMs.
Scares me a bit.
We're not building a language for LLMs just yet.
Working on it, actually! I think it's a really interesting problem space - being efficient on tokens, readable by humans for review, strongly typed and static for reasoning purposes, and having extremely regular syntax. One of the biggest issues with symbols is that, to a human, matching parentheses is relatively easy, but the models struggle with it.
I expect a language like the one I'm playing with will mature enough over the next couple years that models with a knowledge cutoff around 1/2027 will probably know how to program it well enough for it to start being more viable.
One of the things I plan to do is build evals so that I can validate the performance of various models on my as yet only partially baked language. I'm also using only LLMs to build out the entire infrastructure, mostly to see if it's possible.
If you’re going to write an article atleast do the base research yourself man
So I'm not convinced this is either the right metric, or even if you got the right metric that it's a metric you want to minimize.
For a very imperfect human analogy, it feels like saying "a student can spend as much time thinking about the text as they want, so the textbook can be extremely terse".
Definitely just gut feelings though - not well tested or anything. I could be wrong.
I am not sure token efficiency is an interesting problem in the long term, though.
And in the short term I wonder if prompts could be pre-compiled to “compressed tokens”; the idea would be to use a smaller number of tokens to represent a frequently needed concept; kind of like LZ compression. Or maybe token compression becomes a feature of future models optimized for specific tasks.
I was wondering last year if it would be worthwhile trying to create a language that was especially LLM-friendly, eg that embedded more context in the language structure. The idea is to make more of the program and the thinking behind it, explicit to the LLM but in a programming language style to eliminate the ambiguity of natural language (one could just use comments).
Then it occurred to me that with current LLM training methodology that there’s a chicken-and-egg problem; it doesn’t start to show rewards until there is a critical mass of good code in the language for LLMs to train on.
On https://danuker.go.ro/programming-languages.html you can find charts of popularity (TIOBE) vs code density for various programming languages together with which programming languages are Pareto-optimal regarding these two criteria.
Update: I noticed that the author mentions that "APL's famous terseness isn't a plus for LLMs." Isn't that just a design limitation of the LLM tokenizers?
[1]: https://github.com/ETHproductions/japt
Plus, they will strongly "pull" the context when LLM parses it back, to the point of overriding your instructions (true story)
C is surprisingly efficient as well. Minimal keywords, terse syntax, single-character operators. Not much boilerplate, and the core logic is dense.
I think the worst languages are Java, C#, and Rust (lifetime annotations, verbose generics).
In my opinion, C or Go for imperative code, Factor / Forth if the model knows them well.
So: C tokenizes efficiently for equivalent logic, but stdlib poverty makes it expensive for typical benchmark tasks. Same applies to Factor/Forth, arguably worse.
I cannot speak much for C#, but you may be right. Claude's Opus is really good.
C# often has a 'nice' and 'performant' way of doing things (for example, strings are nice, but they allocate and are UTF16, but ReadOnlySpan<byte> is faster for UTF8, and can reuse buffers), the performant syntax often ends up being very verbose, with the nice syntax being barely shorter than Go's. Go also does the right thing by default, and its strings are basically array slices into UTF8 byte arrays.
re: tokens and session length, there are other ways to manage this than language choice. Summarization is one, something I do is to not out read_file content in the messages, but rather in the system prompt. This means that when it tries to reread after edit, we don't have two copies of the file in context.
Going to 10M token sessions, keeping per turn context under 100k, working on Golang... language choice for the sake of tokens does not seem a good thing to decide based on
Claude Code makes some efforts to reduce context size, but at the end of the day is loading entire source files into context (then keeping them there until told to remove them, or context is compacted). One of the major wins is to run subagents for some tasks, that use their own context rather than loading more into CCs own context.
Cursor makes more efficient use of context by building a vector database of code chunks, then only loading matching chunks into context (I believe it does this for Composer/agentic use as well as for tab/autocomplete).
One of the more obvious ways to reduce context use in a larger multi-module codebase would be to take advantage of the split between small module definition (e.g. C++ .h files) and large module implementations (.cpp files). Generally you'd only need to load module interfaces/definitions into context if you are working on code that uses the module, and Cursor's chunked approach can reduce that further.
For whole codebase overview a language server can help locate things, and one could use the AI to itself generate shortish summaries/overviews of source files and the codebase and structure, similar to what a human developer might keep in their head, rather than repeatedly reading entire source files for code that isn't actually being modified.
It seems we're really in the early days of agentic coding tools, and they have a lot of room to get better and more efficient.
If you're interested in learning more, https://github.com/sibyllinesoft/scribe
> Smart code bundler that turns repositories into optimized code bundles meeting a token budget in milliseconds
Ok. So it's a tool, do I use it on my repo once? Then what? Do I use it as I go, does it sit somewhere accessible to something like Claude Code and the onus is on me to direct Claude to use this to search files instead of his out of box workflow ? I can see some CLI examples, what should I do with that where does that fit into what people are using with cursor / claude / gemini etc ?
This is the part I've been trying to hammer home about LLM created stuff. It leaves us with vague not well-understood outcomes that might do something. People are shipping/delivering things they don't even understand now and they often times can't speak to what their thing does with an acceptable level of authority. I'm not against creating tools with LLM's but I'm actually pretty against people creating the basic readme with LLM's. Wanna make a tool in an LLM? More power to you. But make sure you understand what was made, because we need humans in here telling other humans how to use it, because LLMs flat out lose the plot over the course of a large project and I think a big issue is LLM's can sometimes be more eloquent at writing than a lot of people can, so they opt for the LLM-generated readme.
But as someone who would maybe consider using something like this, I see that readme and it just looks like every claude code thing I've put together to date which is to say I've done some seemingly impossible things with Claude only to find that his ability to recap the entirety of it just ended up in a whole lot of seemingly meaningful words and phrases and sentences that actually paint a super disjointed picture of what exactly a repo is about.
Because that’s what happened in the real world when generating a bunch of untyped Python code.
But I would love for more expressive and compact languages to do better, selfish as I am. But I think training data size is more of a factor, and we won’t be all moving up Clojure any time soon.
E.g. when it comes to authoring code, C, which comes language, is by far one of the languages that LLMs excel most at.
Those are pretty terse.
For example I shared some Model code with Claude and Gemini (both via web interfaces) and they both tried to put Controller code into the Model, despite me multiple times telling them that the code wasn't wanted nor needed in there.
I had to (eventually) share the entire project with the models (despite them having been working with the code all along) before they would comply with my request (whilst also congratulating me on my far superior architecture..)
That costs more tokens for each problem than just saying "her look at this section and work toward this goal"