I think I under-specified my requirement. :-) By "16 bytes at a time," I mean, "run a single CPU instruction on those 16 bytes."
But yeah, I get your drift. I can see how it might be theoretically possible. I suppose the key gains might be in how much confidence a programmer can have that their code compiles down to the right set of instructions.