All the naysayer here have clearly no idea. Your large matrix multiplication implementation is quite impressive! I have set up a benchmark loop and let GPT-5.1-Codex-Max experiment for a bit (not 5.2/Opus/Gemini, because they are broken in Copilot), but it seems to be missing something crucial. With a bit of encouragement, it has implemented:
- padding from 2000 to 2048 for easier power-of-two splitting
- two-level Winograd matrix multiplication with tiled matmul for last level
- unrolled AVX2 kernel for 64x64 submatrices
- 64 byte aligned memory
- restrict keyword for pointers
- better compiler flags (clang -Ofast -march=native -funroll-loops -std=c++17)
But yours is still easily 25 % faster. Would you be willing to write a bit about how you set up your evaluation and which tricks Claude used to solve it?