Scaling Latent Reasoning via Looped Language Models (opens in new tab)

(arxiv.org)

84 pointsremexre5mo ago15 comments

15 comments

12 comments · 3 top-level

kelseyfrog5mo ago· 4 in thread

If you squint your eyes it's a fixed iteration ODE solver. I'd love to see a generalization on this and the Universal Transformer metioned re-envisioned as flow-matching/optimal transport models.

kevmo3145mo ago

How would flow matching work? In language we have inputs and outputs but it's not clear what the intermediate points are since it's a discrete space.

Etheryte5mo ago

One of the core ideas behind LLMs is that language is not a discrete space, but instead a multidimensional vector field where you can easily interpolate as needed. It's one of the reasons LLMs readily make up words that don't exist when translating text for example.

kevmo3145mo ago

Not the input and output though, which is the important part for flow matching modeling. Unless you're proposing flow matching over the latent space?

cfcf145mo ago

This makes me think it would be nice to see some kinda child of modern transformer architecture and neural ODEs. There was such interesting work a few years ago on how neural ode/pdes could be seen as a sort of continuous limit of layer depth. Maybe models could learn cool stuff if the embeddings were somehow dynamical model solutions or something.

lukebechtel5mo ago· 3 in thread

so it's:

output = layers(layers(layers(layers(input))))

instead of the classical:

output = layer4(layer3(layer2(layer1(input))))

oofbey5mo ago

Yeah if layers() is a shortcut for layer4(layer3(layer2(layer1(input)))). But sometimes it’s only

output = layers(input)

output = layers(layers(input))

Depends on how difficult the token is.

remexreOP5mo ago

Or more like,

    x = tokenize(input)
    i = 0
    do {
      finish, x = layers(x)
    } while(!finish && i++ < t_max);
    output = lm_head(x)

oofbey5mo ago

That’s closer still. But even closer would be:

    x = tokenize(input)
    i = 0
    finish = 0
    do {
      p, x = layers(x)
      finish += p
    } while(finish < 0.95 && i++ < t_max);
    output = lm_head(x)

Except the accumulation of the stop probabilities isn’t linear like that - it’s more like a weighted coin model.

the84725mo ago· 2 in thread

Does the training process ensure that all the intermediate steps remain interepretable, even on larger models? Not that we end up with some alien gibberish in all but the final step.

oofbey5mo ago

Training doesn’t encourage the intermediate steps to be interpretable. But they are still in the same token vocabulary space, so you could decode them. But they’ll probably be wrong.

the84725mo ago

token vocabulary space is a hull around human communication (emoji, mathematical symbols, unicode scripts, ...), inside that there's lots of unused representation space that an AI could use to represent internal state. So this seems to be bad idea from an safety/oversight perspective.

https://openai.com/index/chain-of-thought-monitoring/

1 more reply

j / k navigate · click thread line to collapse

15 comments

12 comments · 3 top-level

kelseyfrog5mo ago· 4 in thread

If you squint your eyes it's a fixed iteration ODE solver. I'd love to see a generalization on this and the Universal Transformer metioned re-envisioned as flow-matching/optimal transport models.

kevmo3145mo ago

How would flow matching work? In language we have inputs and outputs but it's not clear what the intermediate points are since it's a discrete space.

Etheryte5mo ago

kevmo3145mo ago

Not the input and output though, which is the important part for flow matching modeling. Unless you're proposing flow matching over the latent space?

cfcf145mo ago

lukebechtel5mo ago· 3 in thread

so it's:

output = layers(layers(layers(layers(input))))

instead of the classical:

output = layer4(layer3(layer2(layer1(input))))

oofbey5mo ago

Yeah if layers() is a shortcut for layer4(layer3(layer2(layer1(input)))). But sometimes it’s only

output = layers(input)

output = layers(layers(input))

Depends on how difficult the token is.

remexreOP5mo ago

Or more like,

    x = tokenize(input)
    i = 0
    do {
      finish, x = layers(x)
    } while(!finish && i++ < t_max);
    output = lm_head(x)

oofbey5mo ago

That’s closer still. But even closer would be:

    x = tokenize(input)
    i = 0
    finish = 0
    do {
      p, x = layers(x)
      finish += p
    } while(finish < 0.95 && i++ < t_max);
    output = lm_head(x)

Except the accumulation of the stop probabilities isn’t linear like that - it’s more like a weighted coin model.

the84725mo ago· 2 in thread

Does the training process ensure that all the intermediate steps remain interepretable, even on larger models? Not that we end up with some alien gibberish in all but the final step.

oofbey5mo ago

Training doesn’t encourage the intermediate steps to be interpretable. But they are still in the same token vocabulary space, so you could decode them. But they’ll probably be wrong.

the84725mo ago

https://openai.com/index/chain-of-thought-monitoring/

1 more reply

j / k navigate · click thread line to collapse