And the "how" behind Octavian.jl is basically LoopVectorization.jl [1], which helps make optimal use of your CPU's SIMD instructions.
Currently there can some nontrivial compilation latency with this approach, but since LV ultimately emits custom LLVM it's actually perfectly compatible with StaticCompiler.jl [2] following Mason's rewrite, so stay tuned on that front.
[1] https://github.com/JuliaSIMD/LoopVectorization.jl
[2] https://github.com/tshort/StaticCompiler.jl