Benchmarking GEMM on Modern CPU
Recently I got interested to understand what’s the actual performance of GEMM on modern CPUs. We all understand that GEMM is the backbone for many critical applications and can never emphasize enough how important to optimize it. That being said, I never had a good first-hand experience on benchmarking GEMM on modern CPUs, and here I am.
We know that Intel has the famous MKL library that
is mainly optimized for Intel’s CPU (of course), and
there are open-sourced third party library such as
OpenBLAS (come with numpy
if you install using
pip
). However, what’s our choice for AMD architectures?
It turns out they also have their own optimized
library – AOCL (AMD Optmized CPU Library). And
even better, AOCL is open-sourced (unlike MKL).
Let’s start our journey now.
Set up AOCL
First we download the AOCL master installer from their website, and decompress it. Then we simply install it somewhere:
tar xvf aocl.xxx
cd aocl.xxx
./install.sh -t /opt/aocl
Benchmark GEMM
After install AOCL, it’s very simple to write a gemm kernel and measure the performance. Since I am intersted in the maximal performance, I will run it many times and only take the best performance. So here is the source code for benchmarking a GEMM
Build BLIS
Sometimes you may want to build blis manually. For example, to get rid of dynamic dispatching and simulate it in gem5. Here are the simple steps to build and link blis manually.
git clone git@github.com:amd/blis.git
cd blis
./configure --enable-cblas --enable-threading=openmp --prefix=/usr/local zen2
make install
Replace zen2
with other architectures or auto
to enable dynamic dispatching at runtime. Remove
--enable-threading
to build a single thread
version. The following command should build the
kernel and link it with BLIS.
gcc gemm.c -O3 -I/usr/local/include/blis -lblis-mt -lm -fopenmp -o gemm.exe