Bring AVX Support to Gem5

TL;DR

gem5-avx brings partial AVX2 and AVX-512 support to gem5. Feel free to use it for your research at your own risk. There is a short guide on how to add new instructions at the end of this post. Issues and pull requests are welcome!

Motivation

During my research this year, I had to implement AVX2 and AVX-512 in gem5, as I could not find a good online. Eventually I had a partial implementation which works for the workloads I looked into. Luckily, the paper gets published in this year’s HPCA, and I have already released the code for that project: gem-forge.

There are offical plans to add full AVX support to gem5, but not done yet. While we are waiting for that, I think it would be helpful to others if I separate out my implementation and merged to the latest develop branch. That becomes this project: gem5-avx!

Features

Please note that this is only a partial support, as IMO it takes an experienced Ph.D. student at least a year to fully implement everything. Specfically:

  • Not all instructions are impelmented. I implemented only the instructions I came across during my research. This amounts to roughly 150 unique instructions (not counting variants of different vector width, data type, reg/memory operands), including common arithmetic and data movement instructions, e.g. vaddps, vmovups, vinsertps, etc.) I will have a short guide on how to implement new instructions at the end of this post.
  • Mask and broadcast are not implemented. AVX-512 introduces mask registers and a broadcast bit in the prefix. In my research I found them rarely and just dropped them.
  • Not throughly tested. I tested it with a few microbenchmarks as well as benchmarks for my research, but not every instruction and every version of it. The repository inclues one test case of tiled dense matrix-vector multiplication (see avx-test). However, I think there are certainly bugs in the code (issues and pull requrests are welcome!).
  • Instructions to microops. How instructions are broken into microops are proprietary. Here I take an ideal approach: most arithmetic and load/store instructions are break into a single microop. This may not be very accurate. Please keep this in mind if you use this for research.

Adding New Instructions

There are many examples in src/arch/x86/isa/insts/simd512. Assuming you are familar with gem5’s decoder, adding new instructions generally involves these steps:

  1. Microops: If the new instruction can not be implemented using existing microops, you can define new ones in src/arch/x86/isa/microops/avxop.isa. However, if there is a scalar version of the new instructions, it’s likely there is an existing scalar implementation. You can reuse those microops so that you don’t have to define a new one.
  2. Macroops: Now you can define how the new instruction (macroop) is broken into microops. Take a look at examples in src/arch/x86/isa/insts/simd512.
  3. Encoding: Finally you can teach gem5 how to decode the new instruction. There are two types of prefixes: vex and evex, as well as 3 types of opcodes: two-byte, three-byte-0F38 and three-byte-0F3A. So for example, vaddps has two-byte opcode 0x58 and comes with both vex and evex prefix. Therefore, you can find its encoding defined in src/arch/x86/isa/decoder/two_bytes_opcodes_{vex|evex}.isa.
  4. ModRM and Immediate: The x86 decoder in gem5 uses two tables to check if the instruction requires ModRM and immediate byte. Make sure you define that in src/arch/x86/isa/decoder_tables.cc.
  5. Compressed Displacement: AVX-512 introduces a new addressing mode called “compressed displacement” (see “Compressed Disp8*N Encoding” in Intel’s manual). Make sure you define it correctly in src/arch/x86/isa/decompress_displacement.cc.

That’s it. I hope this post is helpful.