Which programming languages are most token-efficient? (opens in new tab)

(martinalderson.com)

115 pointstehnub3mo ago91 comments

91 comments

I'm biased by my preferred style of programming languages but I think that pure statically typed functional languages are incredibly well suited for LLMs. The purity gives you referential transparency and static analysis powers that the LLM can leverage to stay correctly on task.

The high level declarative nature and type driven development style of languages like Haskell also make it really easy for an experienced developer to review and validate the output of the LLM.

Early on in the GPT era I had really bad experiences generating Haskell code with LLMs but I think that the combination of improved models, increased context size, and agentic tooling has allowed LLMs to really take advantage of functional programming.

eru3mo ago

I'm inclined to agree with you in principle, but there's much, much less Haskell examples in their training corpus than for JavaScript or Python.

tikhonj3mo ago

From what I've heard—and in my own very limited experiments—LLMs are much better at less popular languages than I would have expected. I've had good results with OCaml, and I've talked to people who've had good results with Haskell and even Unison.

I've also seen multiple startups that have had some pretty impressive performance with Lean and Rocq.

My current theory is that as long as the LLM has sufficiently good baseline performance in a language, the kind of scaffolding and tooling you can build around the pure code generation will have an outsize effect, and languages with expressive type systems have a pretty direct advantage there: types can constrain and give immediate feedback to your system, letting you iterate the LLM generation faster and at a higher level than you could otherwise.

I recently saw a paper[1] about using types to directly constrain LLM output. The paper used TypeScript, but it seems like the same approach would work well with other typed languages as well. Approaches like that make generating typed code with LLMs even more promising.

Abstract:

> Language models (LMs) can generate code but cannot guarantee its correctness often producing outputs that violate type safety, program invariants, or other semantic properties. Constrained decoding offers a solution by restricting generation to only produce programs that satisfy user-defined properties. However, existing methods are either limited to syntactic constraints or rely on brittle, ad hoc encodings of semantic properties over token sequences rather than program structure.

> We present ChopChop, the first programmable framework for constraining the output of LMs with respect to semantic properties. ChopChop introduces a principled way to construct constrained decoders based on analyzing the space of programs a prefix represents. It formulates this analysis as a realizability problem which is solved via coinduction, connecting token-level generation with structural reasoning over programs. We demonstrate ChopChop's generality by using it to enforce (1) equivalence to a reference program and (2) type safety. Across a range of models and tasks, ChopChop improves success rates while maintaining practical decoding latency.

[1]: https://arxiv.org/abs/2509.00360

miki1232113mo ago

As a huge proponent of constrained decoding for LLM reliability, I don't quite think it's the right approach for code. This is because in many programming languages, it is legal to use a function before its declaration. Since this is common in existing code, LLMs will also try to write code that way.

torginus3mo ago

You might be right, but I think you must take into account that (I think) you're not super familiar with these languages as well, so you might not notice all the warts a programmer with a lot of experience in these langs would, and overrate the skill of the LLM.

Nowadays, I write C# and TS at work, and it's absolutely crazy how much better the LLM is at TS, with almost all code being decent the first try, but with C# I need to do a lot of massaging.

1 more reply

solomonb3mo ago

You are right that there is significantly more Javascript in the training data, but I can say from experience that I'm a little shocked at how well opus 4.5 has been for me writing Haskell. I'm fairly particular and I end up re-writing a lot of code for style reasons but it can often one shot an acceptable solution that is mostly inline with the rest of the code base.

energy1233mo ago

True for now, but probably not a durable fact. Synthetic data pipelines should be mostly invariant to the programming language, as long as the output is correct. If anything the additional static analysis makes it more amenable to synthetic data generation.

eru3mo ago

> Synthetic data pipelines should be mostly invariant to the programming language, as long as the output is correct.

Well, you can adapt your PHP producing pipeline to produce Haskell code that is correct in the sense of solving the problem at hand, but getting it to produce idiomatic code is probably a lot harder.

1 more reply

joelthelion3mo ago

For the little Haskell I've done with llms, I can tell you they're not bad at it.

Actually, Haskell was a bit too hard for me on my own for real projects. Now with AI assistants, I think it could be a great pick.

kstrauser3mo ago

And yet the models I've used have been great with Rust, which pales in lines of code to JavaScript (or Python or PHP or Perl or C or C++).

eru3mo ago

I've also had decent experiences with Rust recently. I haven't done enough Haskell programming in the AI era to really say.

But it could be that different programming languages are a bit like different human languages for these models: when they have more than some threshold of training data, they can express their general problem solving skills in any of them? And then it's down to how much the compiler and linters can yell at them.

For Rust, I regularly tell them to make `clippy::pedantic` happy (and tell me explicitly when they think that the best way to do that is via an explicit ignore annotation in the code to disable a certain warning for a specific line).

Pedantic clippy is usually too.. pedantic for humans, but it seems to work reasonably well with the agents. You can also add clippy::cargo which ain't included in clippy::pedantic.

1 more reply

keeda3mo ago

It's not just your bias, I too have found great success with a functional programming style, even from the earliest days of ChatGPT. (Not Haskell, but JS, which the models were always good at.)

I think the underlying reason is that functional programming is very conducive to keeping the context tight and focused. For instance, most logic relevant to a task tends to be concentrated in a few functions and data structures across a smallish set of files. That's all you need to feed into the context.

Contrast that with say, Java, where the logic is often spread across a deep inheritance hierarchy located in bunch of separate files. Add to that large frameworks that encapsulate a whole lot of boilerplate and bespoke logic with magic being injected from arbitrary places via e.g. annotations. You'd need to load all of those files (or more likely, simply the whole codebase) and relevant documentation to get accurate results. And even then the additional context is not just extraneous and expensive, but also polluted with irrelevant data that actually reduces accuracy.

A common refrain of mine is that for the best results, you have to invest a lot of time experimenting AND adapt yourself to figure out what works best with AI. In my case, it was gradually shifting to a functional style after spending my whole career writting OO code.

bicx3mo ago

Realistically, it’s also a function of how many iterations it takes for an AI agent to correctly solve a problem with a given language. I’d imagine most AI agents would frequently have to redo J or F# code, as they are fairly uncommon languages with much smaller training set than JavaScript or Python.

Jacques2Marais3mo ago

I can say that for F# this has been mostly true up until quite recently. We use F# at work and were mostly unable to use agents like Claude Code up until the release of Opus 4.5, which seems to know F# quite well.

Smaug1233mo ago

(I concur, Opus 4.5 is quite capable of writing F#.)

jwr3mo ago

I program mostly in Clojure and I expected it to be near the top, as it tends to be very concise and expressive (qualities I really admire). I am getting excellent results from Claude Code (Opus 4.5), and I think this might be one of the reasons. I'm using Claude with a large code base and the token-efficiency of Clojure might help with fitting more into the context window.

kaliszad3mo ago

I also program a lot in Clojure/Script. Do you also consider thinking token and the number of iterations in the token efficiency?

jwr3mo ago

I don't think thinking tokens are affected, as LLMs "think" mostly in plain language, with occasional code snippets.

kaliszad3mo ago

I would assume for certain problems LLMs have a solution readily available for JavaScript/ TypeScript or similarly popular languages but not for Clojure/Script. Therefore my thinking was that the process of getting to a workable solution would be longer and more expensive in terms of tokens. I however don't have any relevant data on this so I may just be wrong.

janalsncm3mo ago

This is kind of just a measurement of how representative a language is in the distribution of the tokenizer training. You could have a single token equal to “public static void main”.

make33mo ago

If you look at the list, you'll see that you're incorrect, as C and JavaScript are not at the top.

Seeing all the C languages and JavaScript at the bottom like this makes me wonder if it's not just that Curly brackets take a lot of tokens.

xigoi3mo ago

I imagine that having to write

  for (int index = 0; index < size; ++index)

instead of

  for index in 0...size

eats up a lot of tokens, especially in C where you also need this construct for iterating over arrays.

cryptonector3mo ago

Well, yes, looking beyond token efficiency I find that the more constrained (stronger and richer static typing) the language the better/faster (fewer rounds of editing and debugging, ergo fewer tokens) the LLM deals with it. C is a nightmare.

moelf3mo ago

the most efficient languages are pretty unpopular, so this argument makes them even more efficient in reality?...

muyuu3mo ago

You could, but you wouldn't when those keywords can all change in equivalent contexts.

janalsncm3mo ago

The BPE or wordpiece tokenization algorithm will greedily take the longest valid token prefix. So if your text starts with “public static void main” it will try to find the longest token which matches that prefix. Even if “public” is a token, it will prefer to tokenize “public static” together.

muyuu3mo ago

yes, but then you have both alternatives as tokens, which nullifies GP's argument

eru3mo ago

What do you mean?

`public` might have a token by itself, even though you can have `pub` occurring in other contexts, too.

muyuu3mo ago

I meant that it wouldn't be efficient to agglomerate tokens in that way and that's why the system won't do it

protocolture3mo ago

I have always had concerns about physical robots making my work less safe in the real world.

But had never considered that a programming language might be created thats less human readable/auditable to enable LLMs.

Scares me a bit.

make33mo ago

LLMs in their current form rely heavily on the vast amount of human data that's available, to learn from it as a first step (the second step is RL).

We're not building a language for LLMs just yet.

jaggederest3mo ago

> We're not building a language for LLMs just yet.

Working on it, actually! I think it's a really interesting problem space - being efficient on tokens, readable by humans for review, strongly typed and static for reasoning purposes, and having extremely regular syntax. One of the biggest issues with symbols is that, to a human, matching parentheses is relatively easy, but the models struggle with it.

I expect a language like the one I'm playing with will mature enough over the next couple years that models with a knowledge cutoff around 1/2027 will probably know how to program it well enough for it to start being more viable.

One of the things I plan to do is build evals so that I can validate the performance of various models on my as yet only partially baked language. I'm also using only LLMs to build out the entire infrastructure, mostly to see if it's possible.

quinnjh3mo ago

do you expect the model to train on synthetic data or do you expect to grow a userbase that will generate organic training data?

> One of the biggest issues with symbols is that, to a human, matching parentheses is relatively easy, but the models struggle with it.

Great point. I find it near trivial to close parens but llms seem to struggle with the lisps ive played with because of this counting issue. To the point where ive not been working with them as much. typescript and functional js as other commentors note is usually smooth sailing.

1 more reply

make33mo ago

There's no way there's enough data for you to get a model that is anywhere as strong as mainstream languages.

If your model is struggling with parentheses, that means it's not even the level of GPT-3 for a mainstream language.

It's not completely impossible with in-context learning I guess, but it will still be much weaker than .. eg all of GitHub and more on Python

zcw1003mo ago

Working on it too. It's actually more like a meta language that is very token efficient.

energy1233mo ago

It's worth asking why we haven't had the AlphaZero moment for general learning yet, where no human data is needed.

make33mo ago

That's easy, AlphaZero had a perfect simulator of the world it existed in (chess, super easy), so it was insanely easy to run simulations of that world ad infinitum, and learn from it.

It's simply not the case for the real world, you can't simulate the world perfectly and see what happens when you do things.

jaggederest3mo ago

I think the issue is that for games and other closed-ended systems the criteria are very easy, so self-referential training is effective.

btbytes3mo ago

Not surprisingly, it is J [1], an APL dialect.

[1] https://www.jsoftware.com/

eimrine3mo ago

I knew it without the reading. But having each system call in 2 versions not even closely related to each other (monadic/diadic) requires me to have a hard time doing learning. I very appreciate this language for shortness but this kind of shortness might annoy.

HtmlProgrammer3mo ago

> I then told Claude Code to suggest a selection of the most popular programming languages

If you’re going to write an article atleast do the base research yourself man

kozika3mo ago

Someone has made a programming language called Sui, which is said to be designed for LLMs. However, using index-based variable names in order to "avoid typo bugs" makes it more difficult than general-purpose languages, and it also has poor token efficiency :(

https://github.com/TakatoHonda/sui-lang

gpm3mo ago

It strikes me that more tokens likely give the LLM more time/space to "think". Also that more redundant tokens, like local type declarations instead of type inference from far away, likely often reduce the portion of the code LLMs (and humans) have to read.

So I'm not convinced this is either the right metric, or even if you got the right metric that it's a metric you want to minimize.

make33mo ago

With Chain of Thoughts (text thinking), the models can already use as much compute as they want in any language (determined by reinforcement learning training)

gpm3mo ago

I'm not convinced that thinking tokens - which sort of have to serve a specific chain of thought purpose - are interchangeable with input tokens during which give the model compute without having it add new text.

For a very imperfect human analogy, it feels like saying "a student can spend as much time thinking about the text as they want, so the textbook can be extremely terse".

Definitely just gut feelings though - not well tested or anything. I could be wrong.

make33mo ago

We could definitely use RL to add blank, invisible "<thinking>" tokens whenever the model thinks it should. Or just allow it to say "Hmm." lol.

Easy to test from a technical perspective is all I'm saying, and not a bad idea.

limoce3mo ago

I think separating thinking tokens from "representing" tokens might be a better approach, like what those thinking models does

efitz3mo ago

This is interesting research; thank you for doing it.

I am not sure token efficiency is an interesting problem in the long term, though.

And in the short term I wonder if prompts could be pre-compiled to “compressed tokens”; the idea would be to use a smaller number of tokens to represent a frequently needed concept; kind of like LZ compression. Or maybe token compression becomes a feature of future models optimized for specific tasks.

I was wondering last year if it would be worthwhile trying to create a language that was especially LLM-friendly, eg that embedded more context in the language structure. The idea is to make more of the program and the thinking behind it, explicit to the LLM but in a programming language style to eliminate the ambiguity of natural language (one could just use comments).

Then it occurred to me that with current LLM training methodology that there’s a chicken-and-egg problem; it doesn’t start to show rewards until there is a critical mass of good code in the language for LLMs to train on.

aleph_minus_one3mo ago

Relevant:

On https://danuker.go.ro/programming-languages.html you can find charts of popularity (TIOBE) vs code density for various programming languages together with which programming languages are Pareto-optimal regarding these two criteria.

bri-holt3mo ago

I suspect DB queries will also benefit from token-efficient query languages as RAG queries grow exponentially. I've been working on one such language that is emitted in a token-efficient IR and compiles to SQL. https://memelang.net/

thw_9a83c3mo ago

There is one class of languages missing in the comparison: Programming golf languages: E.g. Japt [1], Pyth [2] or Jelly [3].

Update: I noticed that the author mentions that "APL's famous terseness isn't a plus for LLMs." Isn't that just a design limitation of the LLM tokenizers?

[1]: https://github.com/ETHproductions/japt

[2]: https://github.com/isaacg1/pyth

[3]: https://github.com/DennisMitchell/jellylanguage

1122333mo ago

If language supports comments and LLM is allowed to write them (or docstrings, or any such), there go your tokens.

Plus, they will strongly "pull" the context when LLM parses it back, to the point of overriding your instructions (true story)

johnisgood3mo ago

Concatenative languages like Factor and Forth are very token-efficient in theory. Theoretically optimal for raw lexical density. No parentheses, no commas, no argument delimiters, just whitespace-separated words, but stack shuffling can add overhead for complex data flow, unless you use "locals" in Factor, for example.

C is surprisingly efficient as well. Minimal keywords, terse syntax, single-character operators. Not much boilerplate, and the core logic is dense.

I think the worst languages are Java, C#, and Rust (lifetime annotations, verbose generics).

In my opinion, C or Go for imperative code, Factor / Forth if the model knows them well.

Smaug1233mo ago

Is that statement about C based on anything in particular? C was 18th of all the languages in the article's chart (the worst!), which I'd guess was due to the absence of a standard library.

johnisgood3mo ago

Fair point. There is a distinction between syntactic efficiency (C is terse) and task-completion efficiency (what the benchmark likely measured). If the tasks involved string manipulation, hash maps, JSON, etc. then C pays a massive token tax because you are implementing what other languages provide in stdlib. Python has dict and json.loads(), C has malloc and strcmp.

So: C tokenizes efficiently for equivalent logic, but stdlib poverty makes it expensive for typical benchmark tasks. Same applies to Factor/Forth, arguably worse.

Bootvis3mo ago

I understand your logic but I found LLM's to be quite strong at C#. It makes little mistakes and the mistakes seem related to the complexity of what I'm doing, not the language itself.

johnisgood3mo ago

See https://news.ycombinator.com/item?id=46586312.

I cannot speak much for C#, but you may be right. Claude's Opus is really good.

torginus3mo ago

This confirms my personal experience with switching to Go from C# - despite the e framework and language being MUCH simpler, the code usually ends up the same length.

C# often has a 'nice' and 'performant' way of doing things (for example, strings are nice, but they allocate and are UTF16, but ReadOnlySpan<byte> is faster for UTF8, and can reuse buffers), the performant syntax often ends up being very verbose, with the nice syntax being barely shorter than Go's. Go also does the right thing by default, and its strings are basically array slices into UTF8 byte arrays.

1 more reply

verdverm3mo ago

Token efficiency is only one metric. Simplicity of syntax and semantics are another valuable one.

re: tokens and session length, there are other ways to manage this than language choice. Summarization is one, something I do is to not out read_file content in the messages, but rather in the system prompt. This means that when it tries to reread after edit, we don't have two copies of the file in context.

Going to 10M token sessions, keeping per turn context under 100k, working on Golang... language choice for the sake of tokens does not seem a good thing to decide based on

HarHarVeryFunny3mo ago

I don't think context size is really the limit for larger codebases - it's more about how you use that context.

Claude Code makes some efforts to reduce context size, but at the end of the day is loading entire source files into context (then keeping them there until told to remove them, or context is compacted). One of the major wins is to run subagents for some tasks, that use their own context rather than loading more into CCs own context.

Cursor makes more efficient use of context by building a vector database of code chunks, then only loading matching chunks into context (I believe it does this for Composer/agentic use as well as for tab/autocomplete).

One of the more obvious ways to reduce context use in a larger multi-module codebase would be to take advantage of the split between small module definition (e.g. C++ .h files) and large module implementations (.cpp files). Generally you'd only need to load module interfaces/definitions into context if you are working on code that uses the module, and Cursor's chunked approach can reduce that further.

For whole codebase overview a language server can help locate things, and one could use the AI to itself generate shortish summaries/overviews of source files and the codebase and structure, similar to what a human developer might keep in their head, rather than repeatedly reading entire source files for code that isn't actually being modified.

It seems we're really in the early days of agentic coding tools, and they have a lot of room to get better and more efficient.

CuriouslyC3mo ago

The approaches used by Claude Code and Cursor are inefficient. It's possible to calculate a covering set for a piece of code and provide that to an agent directly via a tool, and it turns out that this can reduce context usage in SWE-bench style tasks by >90% over RAG and grep/read.

If you're interested in learning more, https://github.com/sibyllinesoft/scribe

trueno3mo ago

Like most LLM-made readme's and the six bajillion AI/agentic/llm tools now on Github I can barely get a grasp on what I'm looking at here, or how to use it practically.

> Smart code bundler that turns repositories into optimized code bundles meeting a token budget in milliseconds

Ok. So it's a tool, do I use it on my repo once? Then what? Do I use it as I go, does it sit somewhere accessible to something like Claude Code and the onus is on me to direct Claude to use this to search files instead of his out of box workflow ? I can see some CLI examples, what should I do with that where does that fit into what people are using with cursor / claude / gemini etc ?

This is the part I've been trying to hammer home about LLM created stuff. It leaves us with vague not well-understood outcomes that might do something. People are shipping/delivering things they don't even understand now and they often times can't speak to what their thing does with an acceptable level of authority. I'm not against creating tools with LLM's but I'm actually pretty against people creating the basic readme with LLM's. Wanna make a tool in an LLM? More power to you. But make sure you understand what was made, because we need humans in here telling other humans how to use it, because LLMs flat out lose the plot over the course of a large project and I think a big issue is LLM's can sometimes be more eloquent at writing than a lot of people can, so they opt for the LLM-generated readme.

But as someone who would maybe consider using something like this, I see that readme and it just looks like every claude code thing I've put together to date which is to say I've done some seemingly impossible things with Claude only to find that his ability to recap the entirety of it just ended up in a whole lot of seemingly meaningful words and phrases and sentences that actually paint a super disjointed picture of what exactly a repo is about.

HarHarVeryFunny3mo ago

This scribe tool seems to offer somewhat similar functionality to a Language Server and/or Cursor's chunked vector index.

The idea would seem to be to give instructions to your agent (Claude Code, etc) to use this tool to discover the chunks of code (not entire source files) it needs to look at to modify a particular function. You could put these instructions on how/when to use scribe someplace like .claude/rules/scribe.md

I assume this is meant to work as an override to Claude Code's normal operation where it reads entire source files into context (not sure on details as to how CC decides which files are relevant if developer hasn't explicitly told it), so if you asked CC to do something that matches the instructions you'd put in scribe.md it would run scribe, send the output (code chunks and file locations) to Claude AI, which would then base it's edit requests on that.

It's not obvious if this --covering-set command is the only one scribe currently supports, or if it has other ones to output code chunks relevant for other use cases.

1 more reply

CuriouslyC3mo ago

The main box on the readme should make it pretty clear. One tool call to get a covering set of a piece of code, versus wasteful grep/read/lsp/etc.

I'm not sure if you're being intentionally obtuse or you just don't have much of an attention span, but I'm not making any money off this so if you want to use 10x more tokens to get stuff done, by all means brother.

didip3mo ago

Does it account for errors generated from Runtime bugs which caused rerunning of prompts?

Because that’s what happened in the real world when generating a bunch of untyped Python code.

tzahifadida3mo ago

An agent can make summaries via Markdown files while processing. Then use that to break the problem to several issues and then tackle them one by one, even automatically, but more usually interactively. The problem is the technique now, not the llm. Yes, it costs a lot (lot) more. But, it can do it, and people work cost way more than tokens.

switchbak3mo ago

I would expect that we’ll end up compressing (or whatever term you would use) this at some point so many of those syntactical differences will not be as significant.

But I would love for more expressive and compact languages to do better, selfish as I am. But I think training data size is more of a factor, and we won’t be all moving up Clojure any time soon.

eigenspace3mo ago

I can't speak to Clojure, but I will say that LLMs are actually surprisingly good at writing and understanding Julia code compared to some languages that have a much larger training corpus to pull from.

tmtvl3mo ago

I kinda (but not really because I don't much care about tokens and don't really know anything about models) wonder about Common Lisp. There's probably far fewer examples of CL code in any training sets than Clojure or Python or whatever, but it could still be somewhat interesting.

epolanski3mo ago

I doubt this to be a meaningful metric for anything but code exploration in a larger codebase.

E.g. when it comes to authoring code, C, which comes language, is by far one of the languages that LLMs excel most at.

anishgupta3mo ago

I guess it also depends on which dataset LLM was trained on. Rare or niche languages get fragmented into more tokens even if the code itself is short. So two languages with the same number of characters can produce very different token counts because one aligns with what the model has seen millions of times and the other does not.

singularity20013mo ago

That's why I love Julia so much. Also it's semi-static?

eigenspace3mo ago

Semantically, julia is a fully dynamic language. But the trick is that it does this by recognizing that being static is a constraint on a dynamic language, so it implements dynamic typing by stitching together islands of statically typed code.

Surac3mo ago

intresting project, do you mind to explain what brought you to do that research? im a litte surprised that the more simple languages tend to use more tokens, but after thing i realizend that languages with more expressiv syntax allow to write with less "words". But i also think it is a little bit like a race of watches. who realy wants to know what watch runs faster?

daft_pink3mo ago

Wow, I’m really shocked that something like Javascript and Typescript which models tend to prefer wasn’t the most efficient.

nineteen9993mo ago

Ive been doing plenty of Z80/x86_64 assembly with it, as well as a little 6502.

Those are pretty terse.

awesome_dude3mo ago

I'm finding that I have to share more and more code to ensure that various standards are being kept.

For example I shared some Model code with Claude and Gemini (both via web interfaces) and they both tried to put Controller code into the Model, despite me multiple times telling them that the code wasn't wanted nor needed in there.

I had to (eventually) share the entire project with the models (despite them having been working with the code all along) before they would comply with my request (whilst also congratulating me on my far superior architecture..)

That costs more tokens for each problem than just saying "her look at this section and work toward this goal"

solumunus3mo ago

You’re severely hampering yourself by not using the CLI tools.

zhisme3mo ago

ruby. that's quite interesting, maybe because it reads mostly like plain english?

nige1233mo ago

Raku

65103mo ago

Curious about FORTH now

andersmurphy3mo ago

Forth

TZubiri3mo ago

05a1be

xigoi3mo ago

Given that APL was punished for using a non-ASCII character set, this would presumably also affect 05AB1E.

j / k navigate · click thread line to collapse

91 comments

solomonb3mo ago

The high level declarative nature and type driven development style of languages like Haskell also make it really easy for an experienced developer to review and validate the output of the LLM.

eru3mo ago

I'm inclined to agree with you in principle, but there's much, much less Haskell examples in their training corpus than for JavaScript or Python.

tikhonj3mo ago

I've also seen multiple startups that have had some pretty impressive performance with Lean and Rocq.

Abstract:

[1]: https://arxiv.org/abs/2509.00360

miki1232113mo ago

torginus3mo ago

Nowadays, I write C# and TS at work, and it's absolutely crazy how much better the LLM is at TS, with almost all code being decent the first try, but with C# I need to do a lot of massaging.

1 more reply

solomonb3mo ago

energy1233mo ago

eru3mo ago

> Synthetic data pipelines should be mostly invariant to the programming language, as long as the output is correct.

Well, you can adapt your PHP producing pipeline to produce Haskell code that is correct in the sense of solving the problem at hand, but getting it to produce idiomatic code is probably a lot harder.

1 more reply

joelthelion3mo ago

For the little Haskell I've done with llms, I can tell you they're not bad at it.

Actually, Haskell was a bit too hard for me on my own for real projects. Now with AI assistants, I think it could be a great pick.

kstrauser3mo ago

And yet the models I've used have been great with Rust, which pales in lines of code to JavaScript (or Python or PHP or Perl or C or C++).

eru3mo ago

I've also had decent experiences with Rust recently. I haven't done enough Haskell programming in the AI era to really say.

Pedantic clippy is usually too.. pedantic for humans, but it seems to work reasonably well with the agents. You can also add clippy::cargo which ain't included in clippy::pedantic.

1 more reply

keeda3mo ago

It's not just your bias, I too have found great success with a functional programming style, even from the earliest days of ChatGPT. (Not Haskell, but JS, which the models were always good at.)

bicx3mo ago

Jacques2Marais3mo ago

Smaug1233mo ago

(I concur, Opus 4.5 is quite capable of writing F#.)

jwr3mo ago

kaliszad3mo ago

I also program a lot in Clojure/Script. Do you also consider thinking token and the number of iterations in the token efficiency?

jwr3mo ago

I don't think thinking tokens are affected, as LLMs "think" mostly in plain language, with occasional code snippets.

kaliszad3mo ago

janalsncm3mo ago

This is kind of just a measurement of how representative a language is in the distribution of the tokenizer training. You could have a single token equal to “public static void main”.

make33mo ago

If you look at the list, you'll see that you're incorrect, as C and JavaScript are not at the top.

Seeing all the C languages and JavaScript at the bottom like this makes me wonder if it's not just that Curly brackets take a lot of tokens.

xigoi3mo ago

I imagine that having to write

  for (int index = 0; index < size; ++index)

instead of

  for index in 0...size

eats up a lot of tokens, especially in C where you also need this construct for iterating over arrays.

cryptonector3mo ago

moelf3mo ago

the most efficient languages are pretty unpopular, so this argument makes them even more efficient in reality?...

muyuu3mo ago

You could, but you wouldn't when those keywords can all change in equivalent contexts.

janalsncm3mo ago

muyuu3mo ago

yes, but then you have both alternatives as tokens, which nullifies GP's argument

eru3mo ago

What do you mean?

`public` might have a token by itself, even though you can have `pub` occurring in other contexts, too.

muyuu3mo ago

I meant that it wouldn't be efficient to agglomerate tokens in that way and that's why the system won't do it

protocolture3mo ago

I have always had concerns about physical robots making my work less safe in the real world.

But had never considered that a programming language might be created thats less human readable/auditable to enable LLMs.

Scares me a bit.

make33mo ago

LLMs in their current form rely heavily on the vast amount of human data that's available, to learn from it as a first step (the second step is RL).

We're not building a language for LLMs just yet.

jaggederest3mo ago

> We're not building a language for LLMs just yet.

quinnjh3mo ago

do you expect the model to train on synthetic data or do you expect to grow a userbase that will generate organic training data?

> One of the biggest issues with symbols is that, to a human, matching parentheses is relatively easy, but the models struggle with it.

1 more reply

make33mo ago

There's no way there's enough data for you to get a model that is anywhere as strong as mainstream languages.

If your model is struggling with parentheses, that means it's not even the level of GPT-3 for a mainstream language.

It's not completely impossible with in-context learning I guess, but it will still be much weaker than .. eg all of GitHub and more on Python

zcw1003mo ago

Working on it too. It's actually more like a meta language that is very token efficient.

energy1233mo ago

It's worth asking why we haven't had the AlphaZero moment for general learning yet, where no human data is needed.

make33mo ago

That's easy, AlphaZero had a perfect simulator of the world it existed in (chess, super easy), so it was insanely easy to run simulations of that world ad infinitum, and learn from it.

It's simply not the case for the real world, you can't simulate the world perfectly and see what happens when you do things.

jaggederest3mo ago

I think the issue is that for games and other closed-ended systems the criteria are very easy, so self-referential training is effective.

btbytes3mo ago

Not surprisingly, it is J [1], an APL dialect.

[1] https://www.jsoftware.com/

eimrine3mo ago

HtmlProgrammer3mo ago

> I then told Claude Code to suggest a selection of the most popular programming languages

If you’re going to write an article atleast do the base research yourself man

kozika3mo ago

https://github.com/TakatoHonda/sui-lang

gpm3mo ago

So I'm not convinced this is either the right metric, or even if you got the right metric that it's a metric you want to minimize.

make33mo ago

With Chain of Thoughts (text thinking), the models can already use as much compute as they want in any language (determined by reinforcement learning training)

gpm3mo ago

For a very imperfect human analogy, it feels like saying "a student can spend as much time thinking about the text as they want, so the textbook can be extremely terse".

Definitely just gut feelings though - not well tested or anything. I could be wrong.

make33mo ago

We could definitely use RL to add blank, invisible "<thinking>" tokens whenever the model thinks it should. Or just allow it to say "Hmm." lol.

Easy to test from a technical perspective is all I'm saying, and not a bad idea.

limoce3mo ago

I think separating thinking tokens from "representing" tokens might be a better approach, like what those thinking models does

efitz3mo ago

This is interesting research; thank you for doing it.

I am not sure token efficiency is an interesting problem in the long term, though.

aleph_minus_one3mo ago

Relevant:

bri-holt3mo ago

thw_9a83c3mo ago

There is one class of languages missing in the comparison: Programming golf languages: E.g. Japt [1], Pyth [2] or Jelly [3].

Update: I noticed that the author mentions that "APL's famous terseness isn't a plus for LLMs." Isn't that just a design limitation of the LLM tokenizers?

[1]: https://github.com/ETHproductions/japt

[2]: https://github.com/isaacg1/pyth

[3]: https://github.com/DennisMitchell/jellylanguage

1122333mo ago

If language supports comments and LLM is allowed to write them (or docstrings, or any such), there go your tokens.

Plus, they will strongly "pull" the context when LLM parses it back, to the point of overriding your instructions (true story)

johnisgood3mo ago

C is surprisingly efficient as well. Minimal keywords, terse syntax, single-character operators. Not much boilerplate, and the core logic is dense.

I think the worst languages are Java, C#, and Rust (lifetime annotations, verbose generics).

In my opinion, C or Go for imperative code, Factor / Forth if the model knows them well.

Smaug1233mo ago

Is that statement about C based on anything in particular? C was 18th of all the languages in the article's chart (the worst!), which I'd guess was due to the absence of a standard library.

johnisgood3mo ago

So: C tokenizes efficiently for equivalent logic, but stdlib poverty makes it expensive for typical benchmark tasks. Same applies to Factor/Forth, arguably worse.

Bootvis3mo ago

I understand your logic but I found LLM's to be quite strong at C#. It makes little mistakes and the mistakes seem related to the complexity of what I'm doing, not the language itself.

johnisgood3mo ago

See https://news.ycombinator.com/item?id=46586312.

I cannot speak much for C#, but you may be right. Claude's Opus is really good.

torginus3mo ago

This confirms my personal experience with switching to Go from C# - despite the e framework and language being MUCH simpler, the code usually ends up the same length.

1 more reply

verdverm3mo ago

Token efficiency is only one metric. Simplicity of syntax and semantics are another valuable one.

Going to 10M token sessions, keeping per turn context under 100k, working on Golang... language choice for the sake of tokens does not seem a good thing to decide based on

HarHarVeryFunny3mo ago

I don't think context size is really the limit for larger codebases - it's more about how you use that context.

It seems we're really in the early days of agentic coding tools, and they have a lot of room to get better and more efficient.

CuriouslyC3mo ago

If you're interested in learning more, https://github.com/sibyllinesoft/scribe

trueno3mo ago

Like most LLM-made readme's and the six bajillion AI/agentic/llm tools now on Github I can barely get a grasp on what I'm looking at here, or how to use it practically.

> Smart code bundler that turns repositories into optimized code bundles meeting a token budget in milliseconds

HarHarVeryFunny3mo ago

This scribe tool seems to offer somewhat similar functionality to a Language Server and/or Cursor's chunked vector index.

It's not obvious if this --covering-set command is the only one scribe currently supports, or if it has other ones to output code chunks relevant for other use cases.

1 more reply

CuriouslyC3mo ago

The main box on the readme should make it pretty clear. One tool call to get a covering set of a piece of code, versus wasteful grep/read/lsp/etc.

didip3mo ago

Does it account for errors generated from Runtime bugs which caused rerunning of prompts?

Because that’s what happened in the real world when generating a bunch of untyped Python code.

tzahifadida3mo ago

switchbak3mo ago

I would expect that we’ll end up compressing (or whatever term you would use) this at some point so many of those syntactical differences will not be as significant.

But I would love for more expressive and compact languages to do better, selfish as I am. But I think training data size is more of a factor, and we won’t be all moving up Clojure any time soon.

eigenspace3mo ago

tmtvl3mo ago

epolanski3mo ago

I doubt this to be a meaningful metric for anything but code exploration in a larger codebase.

E.g. when it comes to authoring code, C, which comes language, is by far one of the languages that LLMs excel most at.

anishgupta3mo ago

singularity20013mo ago

That's why I love Julia so much. Also it's semi-static?

eigenspace3mo ago

Surac3mo ago

daft_pink3mo ago

Wow, I’m really shocked that something like Javascript and Typescript which models tend to prefer wasn’t the most efficient.

nineteen9993mo ago

Ive been doing plenty of Z80/x86_64 assembly with it, as well as a little 6502.

Those are pretty terse.

awesome_dude3mo ago

I'm finding that I have to share more and more code to ensure that various standards are being kept.

That costs more tokens for each problem than just saying "her look at this section and work toward this goal"

solumunus3mo ago

You’re severely hampering yourself by not using the CLI tools.

zhisme3mo ago

ruby. that's quite interesting, maybe because it reads mostly like plain english?

nige1233mo ago

Raku

65103mo ago

Curious about FORTH now

andersmurphy3mo ago

Forth

TZubiri3mo ago

05a1be

xigoi3mo ago

Given that APL was punished for using a non-ASCII character set, this would presumably also affect 05AB1E.

j / k navigate · click thread line to collapse