The problem could be some kind of instability of attention as it scales above 10k tokens. A recent paper suggests attention mechanism needs a default value (a "sink"), and its absence produces instability.
https://arxiv.org/abs/2309.17453
Another paper says the middle part is lossy while the beginning and end are better attended.