Couldn't you organize the accumulators in 8 byte chunks, and leave the upper byte unused. Then you map consecutive digits to those chunks and use 64 bit addition for the accumulation. Then overflow between the bytes would keep the correct result if you do the shuffles correctly, and you have a full byte of overflow buffer.