Because code LLMs have been trained on the syntactic form of the program and not its execution, it's not correct — even if the correlation between variable annotations and requested completions was perfect (which it's not) — to say that the model "understands nullability", because nullability means that under execution the variable in question can become null, which is not a state that it's possible for a model trained only on a million programs' syntax to "understand". You could get the same result if e.g. "Optional" means that the variable becomes poisonous and checking "> 0" is eating it, and "!= None" is an antidote. Human programmers can understand nullability because they've hopefully run programs and understand the semantics of making something null.
The paper could use precise, scientific language (e.g. "the presence of nullable annotation tokens correlates to activation of vectors corresponding to, and emission of, null-check tokens with high precision and accuracy") which would help us understand what we can rely on the LLM to do and what we can't. But it seems like there is some subconscious incentive to muddy how people see these models in the hopes that we start ascribing things to them that they aren't capable of.
The critiques of mental state applied to the LLM's are increasingly applicable to us biologicals, and that's the philosophical abyss we're staring down.
I would go much further than this; but this is a de minimus criteria that the LLM already fails.
What zealots eventually discover is that they can hold their "fanatical proposition" fixed in the face of all opposition to the contrary, by tearing down the whole edifice of science, knowledge, and reality itself.
If you wish to assert, against any reasonable thought, that the sky is a pink dome you can do so -- first that our eyes are broken, and then, eventually that we live in some paranoid "philosophical abyss" carefully constructed to permit your paranoia.
This abursidty is exhausting, and I'd wish one day to find fanatics who'd realise it quickly and abate it -- but alas, I have never.
If you find yourself hollowing-out the meaning of words to the point of making no distinctions, denying reality to reality itself, and otherwise arriving at a "philosophical abyss" be aware that it is your cherished propositions which are the maddness and nothing else.
Here: no, the LLM does not understand. Yes, we do. It is your job to begin from reasonable premises and abduce reasonable theories. If you do not, you will not.
Of course, now you can say "how do you know that our brains are not just efficient computers that run LLMs", but I feel like the onus of proof lies on the makers of this claim, not on the other side.
It is very likely that human intelligence is not just autocomplete on crack, given all we know about neuroscience so far.
Certainly many ancient people worshiped celestial objects or crafted idols by their own hands and ascribed to them powers greater than themselves. That doesn't really help in the long run compared to taking personal responsibility for one's own actions and motives, the best interests of their tribe or community, and taking initiative to understand the underlying cause of mysterious phenomena.
One of the very first tests I did of ChatGPT way back when it was new was give it a relatively complex string manipulation function from our code base, strip all identifying materials from the code (variable names, the function name itself, etc), and then provide it with inputs and ask it for the outputs. I was surprised that it could correctly generate the output from the input.
So it does have some idea of what the code actually does not just syntax.
What makes you think this? It has been trained on plenty of logs and traces, discussion of the behavior of various code, REPL sessions, etc. Code LLMs are trained on all human language and wide swaths of whatever machine-generated text is available, they are not restricted to just code.
So, what word would you propose we use to mean "an LLM's ability (or lack thereof) to output generally correct sentences about the topic at hand"?
At least, it's likely that they've been trained on undergrad textbooks that explain program behaviors and contain exercises.
Here's the Stanford Encyclopedia of Philosophy's superb write up: https://plato.stanford.edu/entries/chinese-room/ It covers, engagingly and with playful wit, not just Searle's original argument but its evolution in his writing, other philosopher's responses/criticisms, and Searle's counter-responses.
It would be good to list a few possible ways of interpreting 'understanding of code'. It could possibly include: 1) Type inference for the result 2) nullability 3) runtime asymptotics 4) What the code does
Nobody interrogates each other's internal states when judging whether someone understands a topic. All we can judge it based on are the words they produce or the actions they take in response to a situation.
The way that systems or people arrive at a response is sort of an implementation detail that isn't that important when judging whether a system does or doesn't understand something. Some people understand a topic on an intuitive, almost unthinking level, and other people need to carefully reason about it, but they both demonstrate understanding by how they respond to questions about it in the exact same way.
To not do that is commonly associated with things like being on the spectrum or cognitive deficiencies.
I'd be really curious to see where the "attention" heads of the LLM look when evaluating the nullability of any given token. Does just it trust the Optional[int] return type signature of the function, or does it also skim through the function contents to understand whether that's correct?
It's fascinating to me to think that the senior developer skillset of being able to skim through complicated code, mentally make note of different tokens of interest where assumptions may need to be double-checked, and unravel that cascade of assumptions to track down a bug, is something that LLMs already excel at.
Sure, nullability is an example where static type checkers do well, and it makes the article a bit silly on its own... but there are all sorts of assumptions that aren't captured well by type systems. There's been a ton of focus on LLMs for code generation; I think that LLMs for debugging makes for a fascinating frontier.
I'm curious if this probing of nullability could be composed with other LLM/ML-based python-typing tools to improve their accuracy.
Maybe even focusing on interfaces such as nullability rather than precise types would work better with a duck-typed language like python than inferring types directly (i.e we don't really care if a variable is an int specifically, but rather that it supports _add or _sub etc. that it is numeric).
my brother in christ, you invented Typescript.
(I agree on the visualization, it's very cool!)
The LLM will not understand, and is incapable of developing an understanding, of a concept not present in its training data.
If try to teach it the basics of the misunderstood concept in your chat, it will reflect back a verbal acknowledgement, restated in different words, with some smoothly worded embellishments which looks like the external trappings of understanding. It's only a mirage though.
The LLM will code anything, no matter how novel, if you give it detailed enough instructions and clarifications. That's just a a language translation task from pseudo-code to code. Being a language model, it's designed for that.
LLM is like the bar waiter who has picked up on economics and politics talk, and is able to interject with something clever sounding, to the surprise of the patrons. Gee, how does he or she understand the workings of the international monetary fund, and what the hell are they doing working in this bar?
Tony Hoare called it "a billion-dollar mistake" (https://en.wikipedia.org/wiki/Tony_Hoare#Apologies_and_retra...), Rust had made core design choices precisely to avoid this mistake.
In practical AI-assisted coding in TypeScript I have found that it is good to add in Cursor Rules to avoid anything nullable, unless it is a well-designed choice. In my experience, it makes code much better.
But in Typescript, who cares? You’d be forced to handle null the same way you’d be forced to handle Maybe<T> = None | Just<T> except with extra, unidiomatic ceremony in the latter case.
Maybe<T> = None | Just<T>
as a core concept then it's idiomatic by definition.Double descent?
"LLM" is a valid category of thing in the world, but it's not a thing like Microsoft Outlook that has well-defined capabilities and limitations. It's frustrating reading these discussions that constantly devolve into one person saying they tried something that either worked or didn't, then 40 replies from other people saying they got the opposite result, possibly with a different model, different version, slight prompt altering, whatever it is.
LLMs possibly have the capability to understand nullability, but that doesn't mean every instance of every model will consistently understand that or anything else. This is the same way humans operate. Humans can run a 4-minute mile. Humans can run a 10-second 100 meter dash. Humans can develop and prove novel math theorems. But not all humans, not all the time, performance depends upon conditions, timing, luck, and there has probably never been a single human who can do all three. It takes practice in one specific discipline to get really good at that, and this practice competes with or even limits other abilities. For LLMs, this manifests in differences with the way they get fine-tuned and respond to specific prompt sequences that should all be different ways of expressing the same command or query but nonetheless produce different results. This is very different from the way we are used to machines and software behaving.
Would be fun if they also „cancelled the nullability direction“.. the llms probably would start hallucinating new explanations for what is happening in the code.
vec(King) - vec(Man) + vec(Woman) = vec(Queen)The code is entirely wrong. That validates something that's close to a NAPN number but isn't actually a NAPN number. In particular the area code cannot start with 0 nor can the central office code. There are several numbers, like 911, which have special meaning, and cannot appear in either position.
You'd get better results if you went to Stack Overflow and stole the correct answer yourself. Would probably be faster too.
This is why "non technical code writing" is a terrible idea. The underlying concept is explicitly technical. What are we even doing?
From: ‘Keep training it, though, and eventually it will learn to insert the None test’
To: ‘Keep training it, though, and eventually the probability of inserting the None test goes up to xx%’
The former is just horse poop, we all know LLMs generate big variance in output.
When challenged, everybody becomes an eliminative materialist even if it's inconsistent with their other views. It's very weird.
Any comparison to the human brain is missing the point that an LLM only simulates one small part, and that's notably not the frontal lobe. That's required for intelligence, reasoning, self-awareness, etc.
So, no, it's not a question of philosophy. For an AI to enter that realm, it would need to be more than just an LLM with some bells and whistles; an LLM plus something else, perhaps, something fundamentally different which does not yet currently exist.
You can say this is not 'real' understanding but you like many others will be unable to clearly distinguish this 'fake' understanding from 'real' understanding in a verifiable fashion, so you are just playing a game of meaningless semantics.
You really should think about what kind of difference is supposedly so important yet will not manifest itself in any testable way - an invented one.
Let's talk about what they can do and where that's trending.
They are not reasoning.