I'm always glad to see more multi-page work in VLM-based OCR. Especially single-pass. One of the few other multi-page papers from recently, MinerU-Popo, treats fixing up multi-page outputs as a post-processing correction step (
https://arxiv.org/abs/2605.24973). Interesting to see the drop-off in quality as you up page count, though.
I also think the attention approach (always attend to the image/prefix, with a sliding window for local context) is neat!
I do wish they updated their comparison table to include more recent work (that scores marginally better on OmniDocBench), like dots.mocr.