Bring AVX Support to Gem5
TL;DR
gem5-avx brings partial AVX2 and AVX-512 support to gem5. Feel free to use it for your research at your own risk. There is a short guide on how to add new instructions at the end of this post. Issues and pull requests are welcome!
Motivation
During my research this year, I had to implement AVX2 and AVX-512 in gem5, as I could not find a good online. Eventually I had a partial implementation which works for the workloads I looked into. Luckily, the paper gets published in this year’s HPCA, and I have already released the code for that project: gem-forge.
There are offical plans to add full AVX support to gem5, but not done yet. While we are waiting for that, I think it would be helpful to others if I separate out my implementation and merged to the latest develop branch. That becomes this project: gem5-avx!
Features
Please note that this is only a partial support, as IMO it takes an experienced Ph.D. student at least a year to fully implement everything. Specfically:
- Not all instructions are impelmented. I implemented only the
instructions I came across during my research. This amounts to roughly
150 unique instructions (not counting variants of different vector
width, data type, reg/memory operands), including common arithmetic
and data movement instructions, e.g.
vaddps
,vmovups
,vinsertps
, etc.) I will have a short guide on how to implement new instructions at the end of this post. - Mask and broadcast are not implemented. AVX-512 introduces mask registers and a broadcast bit in the prefix. In my research I found them rarely and just dropped them.
- Not throughly tested. I tested it with a few microbenchmarks
as well as benchmarks for my research, but not every instruction
and every version of it. The repository inclues one test case of
tiled dense matrix-vector multiplication
(see
avx-test
). However, I think there are certainly bugs in the code (issues and pull requrests are welcome!). - Instructions to microops. How instructions are broken into microops are proprietary. Here I take an ideal approach: most arithmetic and load/store instructions are break into a single microop. This may not be very accurate. Please keep this in mind if you use this for research.
Adding New Instructions
There are many examples in src/arch/x86/isa/insts/simd512
.
Assuming you are familar with gem5’s decoder,
adding new instructions generally involves these steps:
- Microops: If the new instruction can not be implemented using
existing microops, you can define new ones in
src/arch/x86/isa/microops/avxop.isa
. However, if there is a scalar version of the new instructions, it’s likely there is an existing scalar implementation. You can reuse those microops so that you don’t have to define a new one. - Macroops: Now you can define how the new instruction (macroop)
is broken into microops. Take a look at examples in
src/arch/x86/isa/insts/simd512
. - Encoding: Finally you can teach gem5 how to decode the new
instruction. There are two types of prefixes:
vex
andevex
, as well as 3 types of opcodes: two-byte, three-byte-0F38 and three-byte-0F3A. So for example,vaddps
has two-byte opcode0x58
and comes with bothvex
andevex
prefix. Therefore, you can find its encoding defined insrc/arch/x86/isa/decoder/two_bytes_opcodes_{vex|evex}.isa
. - ModRM and Immediate: The x86 decoder in gem5 uses two tables
to check if the instruction requires ModRM and immediate byte. Make
sure you define that in
src/arch/x86/isa/decoder_tables.cc
. - Compressed Displacement: AVX-512 introduces a new addressing
mode called “compressed displacement” (see “Compressed Disp8*N
Encoding” in Intel’s manual). Make sure you define it correctly in
src/arch/x86/isa/decompress_displacement.cc
.
That’s it. I hope this post is helpful.