If we see LLMs as substantial compressed representations of human knowledge/thought/speech/expression—and within that, a representation of the world around us—then dictionary concepts that meaningfully explain this compressed representation should also share structure with human experience.
I don’t mean to take this canonically, it’s representations all the way down, but I can’t help but wonder what the geometry of this dictionary concept space says about us.
My observation is, and this may be more philosophical than technical: this process of "decomposing" middle-layer activations with a sparse autoencoder -- is it capturing accurately underlying features in the latent space of the network, or are we drawing order from chaos, imposing monosemanticity where there aren't any? Or to put it another way, were the features always there, learnt by training, or are we doing post-hoc rationalisations -- where the features exist because that's how we defined the autoencoders' dictionaries, and we learn only what we wanted to learn? Are the alien minds of LLMs truly also operating on a similar semantic space as ours, or are we reading tea leaves and seeing what we want to see?
Maybe this distinction doesn't even make sense to begin with; concepts are made by man, if clamping one of these features modifies outputs in a way that is understandable to humans, it doesn't matter if it's capturing some kind of underlying cluster in the latent space of the model. But I do think it's an interesting idea to ponder.
I'll make a probably bad analogy: does your mindmap place things near each other like my mindmap?
To which I'd say, probably not, mindmaps are very personal, and the more complex we put on ours, the more personal and arbitrary they would be, and the less import the visuals would have
ex. if we have 3 million things on both our mindmaps, it's peering too closely to wonder why you put mcdonalds closer to kids food than restaurants, and you have restaurants in the top left, whereas I put it closer to kids foods, in the top mid left.
It would make sense for the human mental latent spaces to also converge. The reason is that the latent space exists to model the environment, which is largely shared among humans.
More than that, I'd think a better 2D analogy for the latent space is a force-directed graph that you keep shaking as you add things to it. It doesn't seem unlikely for two such graphs, constructed in different order, to still end up identical in the end.
Thirdly:
> if we have 3 million things on both our mindmaps, it's peering too closely to wonder why you put mcdonalds closer to kids food than restaurants, and you have restaurants in the top left, whereas I put it closer to kids foods, in the top mid left.
In 2D analogy, maybe, but that's because of limited space. In 20 000 D analogy, there's no reason for our mind maps to meaningfully differ here; there's enough dimensions that terms can be close to other terms for any relationship you could think of.
I find this statement... controversial?
The canonical example would be mathemathics - are they discovered or invented? Does the idea of '3' or an empty set or a straight line exist without any humans thinking about it or even if it is necessary to have any kind of an universe at all for these concepts to be valid? I think the answers here are 'yes' and 'no'.
Of course, there are still concepts which require grounding in the universe or humanity, but if you can think these up first (...somehow), you should need neither.
Yes, maths is an interesting (and open) question. But also, the rules of maths are the result of some set of axioms — it's not clear to me[1] that the axioms we have are necessarily the ones we must have, even though ours are clearly a really useful set.
We put labels onto the world to make it easier to deal with, but every time I look closer at any concept which has a physical reality associated with it, I find that it's unclear where the boundary should be.
What's a "word"? Does hyphenation or concatenation modify the boundary? What if it was concatenated in a different language and the meaning of the concatenation was loaned separately to the parts, e.g. "schadenfreude"? Was "Brexit" still a word before it was coined — and if yes then what else is, and if no then when did it become a word?
What's a "fish"? Dolphins are mammals, jellyfish have no CNS, molluscs glue themselves to a rock and digest their own brain.
What's a "species"? Not all mules are sterile.
Where's the cut-off between a fertilised human egg and a person? And on the other end, when does death happen?
What counts as "one" anglerfish, given the reproductive cycle has males attaching to and dissolving into the females?
There's only a smooth gradient with no sudden cut-offs going from dust to asteroids to minor planets to rocky planets to gas giants to brown dwarf stars.
There aren't really seven colours in the rainbow, and we have a lot more than five senses — there's not really a good reason to group "pain" and "gentle pressure" as both "touch", except to make it five.
[0] giving rise or likely to give rise to public disagreement
[1] however this is quite possibly due to me being wildly oblivious; the example I'd use is that one of Euclid's axioms turned out to be unnecessary, but so far as I am aware all the others are considered unavoidable?
The prompt itself can trigger the features, so if you say "Try to weave in mentions of San Francisco" the San Francisco feature will be more activated in the response. But having a global equalizer could reduce drift as the conversation continued, perhaps?
Over the next year or so I'm sure it will refine enough to be able to be more like a vector multiplier on activation, but simply flipping it on in general is going to create a very 'obsessed' model as stated.
I was pretty upset seeing the superalignment team dissolve at OpenAI, but as is typical for the AI space, the news of one day was quickly eclipsed by the next day.
Anthropic are really killing it right now, and it's very refreshing seeing their commitment to publishing novel findings.
I hope this finally serves as the nail in the coffin on the "it's just fancy autocomplete" and "it doesn't understand what it's saying, bro" rhetoric.
No matter what, there will always be a group of people saying that. The power and drive of the brain to convince itself that it is weaved of magical energy on a divine substrate shouldn't be underestimated. Especially when media plays so hard into that idea (the robots that lose the war because they cannot overcome love, etc.) because brains really love being told they are right.
I am almost certain that the first conscious silicon (or whatever material) will be subjected to immense suffering until a new generation that can accept the human brains banality can move things forward.
> I am almost certain that the first conscious silicon (or whatever material) will be subjected to immense suffering until a new generation that can accept the human brains banality can move things forward.
Indeed, though as we don't know what we're doing (and have 40 definitions of "consciousness" and no way to test for qualia), I would add that the first AI we make with these properties, will likely suffer from every permutation of severe and mild mental heath disorder that is logically possible, including many we have no word for because they would be incompatible with life if found in an organic brain.
I don’t think this paper does much in the way of your final point, “it doesn’t understand what it’s saying”, though our understanding certainly has improved.
What kind of evidentiary threshold would you want if that's not sufficient?
The virus does not hate you, nor does it love you, but you are made of atoms which it can use for something else.
And telling me "just do both" is enforcing your world view and that is precisely what we're talking about _not_ doing.
An LLM has no goals - it's just a machine optimized to minimize training errors, although I suppose you could view this as an innate hard-coded goal of minimizing next word error (relative to training set), in same way we might say a machine-like insect has some "goals".
Of course RLHF provides a longer time span (entire response vs next word) error to minimize, but I doubt training volume is enough for the model to internally model a goal of manipulating the listener as opposed to just favoring surface forms of response.
Perhaps at some point LLMs will start to evolve from the prompt->response model into something more asynchronous and with some activity happening in the background too.
But simply by approximating human communication which often models goal oriented behavior, an LLM can have implicit goals. Which likely vary widely according to conversation context.
Implicit goals can be very effective. Nowhere in DNA is there any explicit goal to survive. However combinations of genes and markers selected for survivability create creatures with implicit goals to survive as tenacious as any explicit goals might be.
- LLM Just got a whole set of buttons you can push. Potential for the LLM to push its own buttons?
- Read the paper and ctrl+f 'deplorable'. This shows once again how we are underestimating LLM's ability to appear conscious. It can be really effective. Reminiscence of Dr.Ford in Westworld :'you (robots) never look more human than when you are suffering.' Or something like that, anyway. I might be hallucinating dialogue but pretty sure something like that was said and I think it's quite true.
- Intensely realistic roleplaying potential unlocked.
- Efficiency by reducing context length by directly amplifying certain features instead.
Very powerful stuff. I am waiting eagerly when I can play with it myself. (Someone please make it a local feature)
>Used "dictionary learning"
>Found abstract features
>Found similar/close features using distance
>Tried amplifying and suppressing features
Not trying to be snary, but sounds mundane in the ML/LLM world. Then again, significant advances have come from simple concepts. Would love to hear from someone who has been able to try this out.
https://news.ycombinator.com/item?id=40242939
I love seeing the work here -- especially the way that they identified a vector specifically for bad code. I've been trying to explore the way that we can use adversarial training to increase the quality of code generated by our LLMs, and so using this technique to get countering examples of secure vs. insecure code (to bootstrap the training process) is really exciting.
Overall, fascinating stuff!!
HN is often characterized by a very negative tone related to any of these developments, but I really do feel that Anthropic is trying to do a “race to the top” in terms of alignment, though it doesn’t seem like all the other major companies are doing enough to race with them.
Particularly frustrating on HN is the common syllogism of: 1. I believe anything that “thinks” must do X thing. 2. LLM doesn’t do X thing 3. LLM doesn’t think
X thing is usually both poorly justified as constitutive of thinking (usually constitutive of human thinking but not writ large) nor is it explained why it matters whether the label of “thinking” applies to LLM or not if the capabilities remain the same.
Given how often China comes up in the context of AI, I'm wondering: Lots of people in the West treat China as mysterious and alien. I wonder how true that really is (e.g. Confucianism)? Or if it ever was (e.g. perhaps it used to be before industrialisation, which homogenises everyone regardless of the origin)?
I worry this is going to come across as insulting, but that's not my intention. I do this too sometimes; I think everyone does. The point is we shouldn't define true reasoning so narrowly that we think no system capable of it would ever be caught doing what most of us are in fact doing most of the time.
but Karpathy was looking at very simple LSTMs of 1-3 layers, looking at individual nodes/cells, and these results have generally thus far been difficult to replicate among large scale transformers. Karpathy also doesn’t provide a recipe for doing this in his paper, which makes me think he was just guess and checking various cells. The representations discovered are very simple
This seems like it's trivially true; if you find two different features for a concept in two different languages, just combine them and now you have a "multilingual feature".
Or are all of these features the same "size"? They might be and I might've missed it.
Imagine taking Claude, tweaking weights relevant to X and then fine tuning it on knowledge related to X. It could result in more neurons being recruited to learn about X.
Imagine performing this during training to amplify or reduce the importance of certain topics. Train it on a vast corpus, but tune at various checkpoints to ensure the neural network's knowledge distribution skews. This could be a way to get more performance from MoE models.
I am not an expert. Just putting on my generalist hat here. Tell me I'm wrong because I'd be fascinated to hear the reasons.
Damage part X of the network and see what happens. If the subject loses the ability to do Y, then X is responsible for Y.
I’m so fascinated by this stuff but I’m having trouble staying motivated in this short attention span world.
I suspect the time is coming when there will always be an aligned search AI between you and the internet.
Like, its talking about moose magick...
https://www.lesswrong.com/posts/gTZ2SxesbHckJ3CkF/transforme...
Basically finding that transformers don't just store a world-model as in "what does the world that produce the observed inputs look like?", they store a "Mixed-State Presentation", basically a weighted set of possible worlds that produce the observed inputs.
Was the first research work that clued me into what Anthropic's work today ended up demonstrating.
That’s going to completely change what features are looked at.
While they're concerned with safety, I'm much more interested in this as a tool for controllability. Maybe we can finally get rid of the woke customer service tone, and get AI to be more eclectic and informative, and less watered down in its responses.
An actual "thinking machine" would be constantly running computations on its accumulated experience in order to improve its future output and/or further compress its sensory history.
An LLM is doing exactly nothing while waiting for the next prompt.
I think the thing you were looking for was more along the lines of a persistent autonomous agent.
Still, what current LLMs are doing with their fixed rules is only a very limited form of reasoning since they just use a fixed N-steps of rule application to generate each word. People are looking to techniques such "group of experts" prompting to improve reasoning - step-wise generate multiple responses then evaluate them and proceed to next step.
Frankly this objection seems very weak
This is currently done with multiple LLMs and calls, not within the running of a single model i/o
Another example would be to input a single token or gibberish, the models we have today are more than happy to spit out fantastic numbers of tokens. They really only stop because we look for stop words they are trained to generate and we do the actual stopping action
It's an interesting window on people's intuitions -- this pattern felt surprising and alien now to someone who imbibed Hofstadter and Dennett, etc., as a teen in the 80s.
(TBC, the surprise was not that people weren't sure they "think" or are "conscious", it's that they were sure they aren't, on this basis that the program is not running continually.)
I see thinking as less about "timing" and more about a "process"
What this post seems to be describing is more about where attention is paid and what neurons fire for various stimuli