Benchmarking GEMM on Modern CPU

Recently I got interested to understand what’s the actual performance of GEMM on modern CPUs. We all understand that GEMM is the backbone for many critical applications and can never emphasize enough how important to optimize it. That being said, I never had a good first-hand experience on benchmarking GEMM on modern CPUs, and here I am.

We know that Intel has the famous MKL library that is mainly optimized for Intel’s CPU (of course), and there are open-sourced third party library such as OpenBLAS (come with numpy if you install using pip). However, what’s our choice for AMD architectures? It turns out they also have their own optimized library – AOCL (AMD Optmized CPU Library). And even better, AOCL is open-sourced (unlike MKL). Let’s start our journey now.

Set up AOCL

First we download the AOCL master installer from their website, and decompress it. Then we simply install it somewhere:

tar xvf aocl.xxx
cd aocl.xxx
./install.sh -t /opt/aocl

Benchmark GEMM

After install AOCL, it’s very simple to write a gemm kernel and measure the performance. Since I am intersted in the maximal performance, I will run it many times and only take the best performance. So here is the source code for benchmarking a GEMM

Build BLIS

Sometimes you may want to build blis manually. For example, to get rid of dynamic dispatching and simulate it in gem5. Here are the simple steps to build and link blis manually.

git clone git@github.com:amd/blis.git
cd blis
./configure --enable-cblas --enable-threading=openmp --prefix=/usr/local zen2
make install

Replace zen2 with other architectures or auto to enable dynamic dispatching at runtime. Remove --enable-threading to build a single thread version. The following command should build the kernel and link it with BLIS.

gcc gemm.c -O3 -I/usr/local/include/blis -lblis-mt -lm -fopenmp -o gemm.exe