It’s just really impractical to use a licensed programming language in 2025.
Possibly rose-tinted glasses on my part, but I’m optimistic for 2026. Chris Lattner has a pretty strong track record of getting these things right.
I can't say for sure because I couldn't find the CUDA kernel but I kind of doubt this is true. You can hit memory bandwidth on Hopper without using TMA at all, which is mostly designed for accelerating asynchronous copies and reducing memory pressure. If all you are doing is a transpose you don't need any of this to go fast (though it might simplify your indexing code…?)
Isn't it better to simply combine the transposition with whatever next operation one wishes to do with the matrix?
Great write up! I learned a lot!
Gee, for the polar decomposition, Gauss-Seidel, etc., looked really hard for those in my IBM PC/XT and couldn't find any!!!
You have global memory and shared memory, the global is slower.
You read in rows in the global memory (faster than reading columns)
You write in columns in the shared memory (slower than in rows, but the shared memory is fast, this is the transpose operation)
You read in rows in the shared memory (very fast)
You write in rows in the global memory (faster than writing in columns)
The idea behind that tiling is to hide the slow part in a memory that is faster.
Also, the improvement is 0.14%, not 14% making the editorialized linkbait particularly egregious.
transpose_naive - Basic implementation with TMA transfers
transpose_swizzle - Adds swizzling optimization for better memory access patterns
transpose_swizzle_batched - Adds thread coarsening (batch processing) on top of swizzling
Performance comparison with CUDA: The Mojo implementations achieve bandwidths of:
transpose_naive: 1056.08 GB/s (32.0025% of max)
transpose_swizzle: 1437.55 GB/s (43.5622% of max)
transpose_swizzle_batched: 2775.49 GB/s (84.1056% of max)
via the GitHub - simveit/efficient_transpose_mojo
Comparing to the CUDA implementations mentioned in the article:
Naive kernel: Mojo achieves 1056.08 GB/s vs CUDA's 875.46 GB/s
Swizzle kernel: Mojo achieves 1437.55 GB/s vs CUDA's 1251.76 GB/s
Batched swizzle kernel: Mojo achieves 2775.49 GB/s vs CUDA's 2771.35 GB/s
So there is highly efficient matrix transpose in Mojo
All three Mojo kernels outperform their CUDA counterparts, with the naive and swizzle kernels showing significant improvements (20.6% and 14.8% faster respectively), while the final optimized kernel achieves essentially identical performance (slightly better by 4.14 GB/s).
The "flag" here seemed innapropriate given that its true this implementation is indeed faster, and certainly the final iteration could be improved on further. It wasn't wrong to say 14% or even 20%.
Email the mods at hn@ycombinator.com. There's a chance they'll remove the flag and re-up the post.
> "From the moment I understood the weakness of my flesh, it disgusted me. I craved the strength and certainty of steel."
14% all the time vs 35% some of the time
edit: Closing numbers are far less impressive than those buried in the middle of the post. Confusing; bye everyone