TL;DR A loop that carries any pure scalar state can be strip-mined across p
threads by having each thread privately replay ≤ p(p-1)/2 “warm-up” updates
before its first public iteration. No closed-form skip-ahead, no speculation,
and a few extra machine instructions in code-gen.