Yeah, I've considered picking it up. Glancing at Fortran code is a bit off-putting though.
You aren't going to beat my code by much, much less by double. I did want to see what was possible by doing everything on the cpu. Your point about portability stands, but of course assembler is perfectly portable (you'd just have to adjust the initial moves according to abi differences). However, if Fortran does come close to my code's performance, there would be a huge time savings in coding time.