DenseFormer: Enhancing Information Flow in Transformers (opens in new tab)

(arxiv.org)

123 pointstipsytoad2y ago33 comments

33 comments

25 comments · 10 top-level

p1esk2y ago· 6 in thread

This method has only been tested on tiny models (<1B) and tiny dataset (17B tokens). It’s not clear if it scales.

To be fair to the authors they are affiliated with a university and not a big industrial lab, so they may be working with significantly constrained resources. Not sure exactly what the best solution is for this case given that it affects most people outside of a very select few.

p1esk2y ago

They could partner with big industrial labs.

2 more replies

Buttons8402y ago

If a genie appeared and granted one wish, I would wish that we find an extremely powerful machine learning technique that doesn't scale. Imagine if an average desktop computer was almost as good as a billion dollar super computer.

In other words, I don't really care if it scales. I almost hope it doesn't.

p1esk2y ago

Not sure I understand what you mean by “doesn’t scale”. Are you trying to say you would like to see a tiny model performing as well as a large model?

MacsHeadroom2y ago

Even pocket computers (smartphones) are already better than billion dollar supercomputers from decades past.

What is your point?

1 more reply

jal2782y ago

But it may scale -- that's science in progress

valine2y ago· 4 in thread

The architecture changes are very straight forward. Model merging has shown that pre-trained transformer layers are very robust. I’ll bet it’s possible to fine tune a pre-trained model like mistral to use this architecture. That would enable someone to test it with more parameters without training a whole new base model.

numeri2y ago

They try this in the appendix without success, unfortunately. It seems having this enabled early on in training is important.

matteopagli2y ago

We're still working on training the DWA weights on top of a pretained model. We're hopeful that this is feasible. The experiments you're mentioning in the appendix are not changing the learning rate scheduler. E.g., when starting to train the DWA weights after 20k iterations, the learning rate is already quite small. To some extent, this might explain the diminishing returns. Maybe this could work with a completely different learning rate scheduler.

1 more reply

bilsbie2y ago

I haven’t been able to make sense of model merging. Any insights?

Wouldn’t weights between models be completely different? And then there are architecture differences on top of that.

valine2y ago

Model merging is usually done with different fine-tunes of the same model. It doesn’t work if the base models are different.

One of the more surprising things is that you can actually repeat layers to improve model performance, ie 1-1-2-2 instead of 1-2. That’s how you get models with higher parameter counts than the original.

1 more reply

ml_basics2y ago· 3 in thread

Cool paper. Really interesting to see how even quite straightforward architectural modifications haven't yet all been exhausted yet, despite all the resources being poured into LLMs

samus2y ago

The problem is that they have to be tested for 7B models at least to show promise for larger models. And that requires significant compute resources.

tbalsam2y ago

Due to some of my personal experiences over the years w/ model development, I believe that this is more due to a failure of the current mainline version of Transformers (the ++ version I believe) not scaling properly, vs an indicator of scale.

If that is the case, then it may well be possible to fix some of the scaling issues more apparent with smaller transformer models (maybe not, though). This is at least some of the reasoning that I've been applying when developing hlb-gpt, for example. It's partially also why I think changing how we use nonlinearities within the network might impact scaling, due to some of the activation spikes used in more linear regions of the network to control network behavior in a way not originally intended.

Agreed that it does require a ton of resources though. But I do think that the problem can be solved on a smaller scale. If we don't have a cleanly logarithmic curve, then I think that something is dearly wrong with our base architecture. (However, of course, I may entirely be missing something here).

quotemstr2y ago

I wonder whether we're missing out on techniques that work well on large models but that don't show promise on small ones

1 more reply

matteopagli2y ago· 1 in thread

I'm one of the authors, happy to answer questions.

EvkoGS2y ago

Is it possible to combine your approach with NATTEN? It seems that both approaches are optimizing from different directions and can be combined with significant throughput and small performance improvements?

aoeusnth12y ago· 1 in thread

> Impact statement:

> This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

I found this particularly charming.

polygamous_bat2y ago

AFAIK this was the default, copy paste impact statement by ICML template.

tbalsam2y ago

This is a very interesting idea, with DenseNets there are oftentimes some terrible memory gotchas that have gotten me over the past 7-8 years or so, so a part of me is sorta leaning back waiting for some memory usage shoe to drop not specified in the paper (even with the activation patterns!)

However, maybe this is not the case. I have a bit of a history of messing with residuals in neural networks, seeing more work on it is good. Fast training networks of course are a very slightly mild obsession of mine as well, and very useful to the field. Here's hoping it pans out as a motif, curious to see where it goes.

sp3322y ago

Even better is the result on page 7 that perplexity drops faster by wall-clock time. Even if you're getting fewer iterations per hour of rented GPU time, you're still coming out ahead in model performance.

microtonal2y ago

Nice finding and makes a lot of sense! It is somewhat related to classification heads using their own weighted representation of all transformer layer outputs.

I only glanced the paper, but they don't seem to softmax ⍺_i for normalization?

zwaps2y ago

1. They compare with an older sort of standard implementation of a transformer Unsure whether the results would be equally significant compared to models with gated units or multiquery etc.

2. The difference seems to diminish with scale. Real life transformers obviously are much larger and train on many more tokens.

3. A very significant part of training transformer models are the throughoutput and memory optimizations. I wonder how their model would work with such fused kernels or specialized paged KV cache schemes. Or activation checkpointing, if run locally.

4. Indeed they claim no memory impact, but their code shows that their experiments are conducted with a special optimized version which requires all activations to reside in a single tensor at all times. Not sure this would work with 3d parallelism on multiple nodes etc.

efrank32y ago

Can't believe nobody thought of this yet

j / k navigate · click thread line to collapse

33 comments

25 comments · 10 top-level

p1esk2y ago· 6 in thread

This method has only been tested on tiny models (<1B) and tiny dataset (17B tokens). It’s not clear if it scales.

ml_basics2y ago

p1esk2y ago

They could partner with big industrial labs.

2 more replies

Buttons8402y ago

In other words, I don't really care if it scales. I almost hope it doesn't.

p1esk2y ago

Not sure I understand what you mean by “doesn’t scale”. Are you trying to say you would like to see a tiny model performing as well as a large model?

MacsHeadroom2y ago

Even pocket computers (smartphones) are already better than billion dollar supercomputers from decades past.

What is your point?

1 more reply

jal2782y ago

But it may scale -- that's science in progress

valine2y ago· 4 in thread

numeri2y ago

They try this in the appendix without success, unfortunately. It seems having this enabled early on in training is important.

matteopagli2y ago

1 more reply

bilsbie2y ago

I haven’t been able to make sense of model merging. Any insights?

Wouldn’t weights between models be completely different? And then there are architecture differences on top of that.

valine2y ago

Model merging is usually done with different fine-tunes of the same model. It doesn’t work if the base models are different.

1 more reply

ml_basics2y ago· 3 in thread

Cool paper. Really interesting to see how even quite straightforward architectural modifications haven't yet all been exhausted yet, despite all the resources being poured into LLMs

samus2y ago

The problem is that they have to be tested for 7B models at least to show promise for larger models. And that requires significant compute resources.

tbalsam2y ago

quotemstr2y ago

I wonder whether we're missing out on techniques that work well on large models but that don't show promise on small ones

1 more reply

matteopagli2y ago· 1 in thread

I'm one of the authors, happy to answer questions.

EvkoGS2y ago

aoeusnth12y ago· 1 in thread

> Impact statement:

I found this particularly charming.

polygamous_bat2y ago

AFAIK this was the default, copy paste impact statement by ICML template.

tbalsam2y ago

sp3322y ago

microtonal2y ago

Nice finding and makes a lot of sense! It is somewhat related to classification heads using their own weighted representation of all transformer layer outputs.

I only glanced the paper, but they don't seem to softmax ⍺_i for normalization?

zwaps2y ago

1. They compare with an older sort of standard implementation of a transformer Unsure whether the results would be equally significant compared to models with gated units or multiquery etc.

2. The difference seems to diminish with scale. Real life transformers obviously are much larger and train on many more tokens.

efrank32y ago

Can't believe nobody thought of this yet

j / k navigate · click thread line to collapse