I have benches on i5-5257U (dual core from old MBP15), i9-9980XE (Skylake-X 18 cores), Dual Xeon Gold 6132, AMD 7840U.
See: https://github.com/mratsim/laser/blob/master/benchmarks%2Fge...
And using my own threadpool instead of OpenMP - https://github.com/mratsim/weave/issues/68#issuecomment-5692... - https://github.com/mratsim/weave/pull/94
Reproduction:
- Assuming x86 and preferably Linux.
- Install Nim
- Install a C compiler with OpenMP support (not the default MacOS Clang)
- Install git
The repo submodules MKLDNN (now Intel oneDNN) to bench vs Intel JIT Compiler
```
git clone https://github.com/mratsim/laser
cd laser
git submodule update --init --recursive
nim cpp -r --outdir:build -d:danger -d:openmp benchmarks/gemm/gemm_bench_float32.nim
```
This should output something like this
```
Laser production implementation
Collected 10 samples in 0.230 seconds
Average time: 22.684 ms
Stddev time: 0.596 ms
Min time: 21.769 ms
Max time: 23.603 ms
Perf: 624.037 GFLOP/s
OpenBLAS benchmark
Collected 10 samples in 0.216 seconds
Average time: 21.340 ms
Stddev time: 3.334 ms
Min time: 19.346 ms
Max time: 27.502 ms
Perf: 663.359 GFLOP/s
MKL-DNN JIT AVX512 benchmark
Collected 10 samples in 0.201 seconds
Average time: 19.775 ms
Stddev time: 8.262 ms
Min time: 15.625 ms
Max time: 43.237 ms
Perf: 715.855 GFLOP/s ```
Note: the Theoretical peak limit is hardcoded and used my previous machine i9-9980XE.
It maybe that your BLAS library is not named libopenblas.so, you can change that here: https://github.com/mratsim/laser/blob/master/benchmarks/thir...
Implementation is in this folder: https://github.com/mratsim/laser/tree/master/laser/primitive...
in particular, tiling, cache and register optimization: https://github.com/mratsim/laser/blob/master/laser/primitive...
AVX512 code generator: https://github.com/mratsim/laser/blob/master/laser/primitive...
And generic Scalar/SSE/AVX/AVX2/AVX512 microkernel generator (this is Nim macros to generate code at compile-time): https://github.com/mratsim/laser/blob/master/laser/primitive...
I'll come back later with details on how to use my custom HPC threadpool Weave instead of OpenMP (https://github.com/mratsim/weave/tree/master/benchmarks/matm...). As a side bonus it also has parallel nqueens implemented.
/home/bjourne/p/laser/benchmarks/gemm/gemm_bench_float32.nim(77, 8) Warning: use `std/os` instead; ospaths is deprecated [Deprecated] /home/bjourne/p/laser/benchmarks/gemm/gemm_bench_float32.nim(101, 8) template/generic instantiation of `bench` from here /home/bjourne/p/laser/benchmarks/gemm/gemm_bench_float32.nim(106, 21) template/generic instantiation of `gemm_nn_fallback` from here /home/bjourne/p/laser/benchmarks/gemm/arraymancer/blas_l3_gemm.nim(85, 34) template/generic instantiation of `newBlasBuffer` from here /home/bjourne/p/laser/benchmarks/gemm/arraymancer/blas_l3_gemm_data_structure.nim(30, 6) Error: signature for '=destroy' must be proc[T: object](x: var T) or proc[T: object](x: T)
Anyway the reason for your competitive performance is likely that you are benchmarking with very small matrices. OpenBLAS spends some time preprocessing the tiles which doesn't really pay off until they become really huge.