undefined | Better HN

0 pointsmratsim2y ago0 comments

The code is open-source

I have benches on i5-5257U (dual core from old MBP15), i9-9980XE (Skylake-X 18 cores), Dual Xeon Gold 6132, AMD 7840U.

See: https://github.com/mratsim/laser/blob/master/benchmarks%2Fge...

And using my own threadpool instead of OpenMP - https://github.com/mratsim/weave/issues/68#issuecomment-5692... - https://github.com/mratsim/weave/pull/94

0 comments

3 comments · 1 top-level

bjourne2y ago· 2 in thread

Can you explain how to build your project and how to run the benchmarks? Cause I just spent a few hours disproving another poster's claim of getting OpenBLAS-like performance and I won't want to waste more time (https://news.ycombinator.com/item?id=38867009). While I don't know Nim very well, I dare claim that you don't get anywhere near OpenBLAS performance.

mratsimOP2y ago

First we can use Laser, which was my initial BLAS experiment in 2019. At the time in particular, OpenBLAS didn't properly use the AVX512 VPUs. (See thread in BLIS https://github.com/flame/blis/issues/352 ), It has made progress since then, still, on my current laptop perf is in the same range

Reproduction:

- Assuming x86 and preferably Linux.

- Install Nim

- Install a C compiler with OpenMP support (not the default MacOS Clang)

- Install git

The repo submodules MKLDNN (now Intel oneDNN) to bench vs Intel JIT Compiler

```

git clone https://github.com/mratsim/laser

cd laser

git submodule update --init --recursive

nim cpp -r --outdir:build -d:danger -d:openmp benchmarks/gemm/gemm_bench_float32.nim

```

This should output something like this

```

Laser production implementation

Collected 10 samples in 0.230 seconds

Average time: 22.684 ms

Stddev time: 0.596 ms

Min time: 21.769 ms

Max time: 23.603 ms

Perf: 624.037 GFLOP/s

OpenBLAS benchmark

Collected 10 samples in 0.216 seconds

Average time: 21.340 ms

Stddev time: 3.334 ms

Min time: 19.346 ms

Max time: 27.502 ms

Perf: 663.359 GFLOP/s

MKL-DNN JIT AVX512 benchmark

Collected 10 samples in 0.201 seconds

Average time: 19.775 ms

Stddev time: 8.262 ms

Min time: 15.625 ms

Max time: 43.237 ms

Perf: 715.855 GFLOP/s ```

Note: the Theoretical peak limit is hardcoded and used my previous machine i9-9980XE.

It maybe that your BLAS library is not named libopenblas.so, you can change that here: https://github.com/mratsim/laser/blob/master/benchmarks/thir...

Implementation is in this folder: https://github.com/mratsim/laser/tree/master/laser/primitive...

in particular, tiling, cache and register optimization: https://github.com/mratsim/laser/blob/master/laser/primitive...

AVX512 code generator: https://github.com/mratsim/laser/blob/master/laser/primitive...

And generic Scalar/SSE/AVX/AVX2/AVX512 microkernel generator (this is Nim macros to generate code at compile-time): https://github.com/mratsim/laser/blob/master/laser/primitive...

I'll come back later with details on how to use my custom HPC threadpool Weave instead of OpenMP (https://github.com/mratsim/weave/tree/master/benchmarks/matm...). As a side bonus it also has parallel nqueens implemented.

bjourne2y ago

The compilation command errors out for me:

/home/bjourne/p/laser/benchmarks/gemm/gemm_bench_float32.nim(77, 8) Warning: use `std/os` instead; ospaths is deprecated [Deprecated] /home/bjourne/p/laser/benchmarks/gemm/gemm_bench_float32.nim(101, 8) template/generic instantiation of `bench` from here /home/bjourne/p/laser/benchmarks/gemm/gemm_bench_float32.nim(106, 21) template/generic instantiation of `gemm_nn_fallback` from here /home/bjourne/p/laser/benchmarks/gemm/arraymancer/blas_l3_gemm.nim(85, 34) template/generic instantiation of `newBlasBuffer` from here /home/bjourne/p/laser/benchmarks/gemm/arraymancer/blas_l3_gemm_data_structure.nim(30, 6) Error: signature for '=destroy' must be proc[T: object](x: var T) or proc[T: object](x: T)

Anyway the reason for your competitive performance is likely that you are benchmarking with very small matrices. OpenBLAS spends some time preprocessing the tiles which doesn't really pay off until they become really huge.

1 more reply

j / k navigate · click thread line to collapse