undefined | Better HN

0 pointsjohndough5mo ago0 comments

All the naysayer here have clearly no idea. Your large matrix multiplication implementation is quite impressive! I have set up a benchmark loop and let GPT-5.1-Codex-Max experiment for a bit (not 5.2/Opus/Gemini, because they are broken in Copilot), but it seems to be missing something crucial. With a bit of encouragement, it has implemented:

    - padding from 2000 to 2048 for easier power-of-two splitting
    - two-level Winograd matrix multiplication with tiled matmul for last level
    - unrolled AVX2 kernel for 64x64 submatrices
    - 64 byte aligned memory
    - restrict keyword for pointers
    - better compiler flags (clang -Ofast -march=native -funroll-loops -std=c++17)

But yours is still easily 25 % faster. Would you be willing to write a bit about how you set up your evaluation and which tricks Claude used to solve it?

0 comments

1 comments · 1 top-level

josu5mo ago

Thank you. Yeah, I'm doing all those things, which do get you close to the top. The rest of things I'm doing are mostly micro-optimizations such as finding a way to avoid AVX→SSE transition penalty (1-2% improvement).

But I don't want to spoil the fun. The agents are really good at searching the web now, so posting the tricks here is basically breaking the challenge.

For example, chatGPT was able to find Matt's blog post regarding Task 1, and that's what gave me the largest jump: https://blog.mattstuchlik.com/2024/07/12/summing-integers-fa...

Interestingly, it seems that Matt's post is not on the training data of any of the major LLMs.

j / k navigate · click thread line to collapse