undefined | Better HN

0 pointsjbarrow13h ago0 comments

I'm always glad to see more multi-page work in VLM-based OCR. Especially single-pass. One of the few other multi-page papers from recently, MinerU-Popo, treats fixing up multi-page outputs as a post-processing correction step (https://arxiv.org/abs/2605.24973). Interesting to see the drop-off in quality as you up page count, though.

I also think the attention approach (always attend to the image/prefix, with a sliding window for local context) is neat!

I do wish they updated their comparison table to include more recent work (that scores marginally better on OmniDocBench), like dots.mocr.

0 comments

1 comments · 1 top-level

vrc12h ago

What are your thoughts on the detector --> VLM pipelines, and if there's ever a world where a small LM or LM augmented detector can be efficient enough to play a role as router. I ask because I recognize you from your handle and am very familiar with your work in the doc+detector space.

j / k navigate · click thread line to collapse