Modelling long sequences has always been hard for transformer-based models. This paper proposes a super innovative way for the transformer to cache previously processed tokens. And it makes generation 9X faster. This is truly mind-blowing
Paper
https://arxiv.org/abs/2012.15832
Code
https://github.com/ofirpress/shortformer