Extend RISC-V ISA with Matrix Multiply

Many computer architecture students will eventually add some new (useless) instructions to a ISA, and there are many great tutorial on how to do this. This is yet another notes on my struggling to extend RISC-V with a new matrix multiply instruction. The overall goal is to add a new functional unit in a general purpose processor to do a 4x4 matrix multiply accumulate with single-precision floating-point.

The Manual has a very detailed explanation on what ISA encoding is reserved for such non-standard extension (see ch.21). Basially any instruction with a prefix (starting from LSB) 00 010 11 or 01 010 11 is reserved.

Since the matrix is 64 bytes (16 single-precision floating-point elements), and can not be hold in a normal register. We also need to introduce new registers as well as special load and store instructions.

Define New Instructions.

First, let’s set up riscv-tools and riscv-gnu-toolchain. We then define our new instructions in riscv-tools/riscv-opcodes/opcodes:

matrix_multiply rd rs1 rs2     31..25=0 14..12=0  6..5=0 4..2=2 1..0=3 # R-type
matrix_ld       rd rs1 imm12            14..12=1  6..5=0 4..2=2 1..0=3 # I-type
matrix_st       imm12hi rs1 rs2 imm12lo 14..12=2  6..5=0 4..2=2 1..0=3 # S-type

Notice the bit 0 to 6 is set to 00 010 11, which indicates these are non-standard extension instructions. Rebuild the repo.

> ./build.sh

You can find something new in riscv-tools/riscv-isa-sim/riscv/encoding.h:

#define MATCH_MATRIX_MULTIPLY 0xb
#define MASK_MATRIX_MULTIPLY  0xfe00707f
#define MATCH_MATRIX_LD 0x100b
#define MASK_MATRIX_LD  0x707f
#define MATCH_MATRIX_ST 0x200b
#define MASK_MATRIX_ST  0x707f

Add New Instructions to GNU Toolchain.

Now we add our new instructions to the GNU toolchain so that we can write inline assembly. This is fairly simple. Add the above macros to riscv-gnu-toolchain/riscv-gdb/include/opcode/riscv-opc.h and riscv-gnu-toolchain/riscv-binutils/include/opcode/riscv-opc.h. Then add the following lines to riscv-opc.c:

{"matrix.multiply", 0, {"I", 0},   "d,s,t",  MATCH_MATRIX_MULTIPLY, MASK_MATRIX_MULTIPLY, match_opcode, 0 },
{"matrix.ld",       0, {"I", 0},   "d,o(s)",  MATCH_MATRIX_LD, MASK_MATRIX_LD, match_opcode, 0 },
{"matrix.st",       0, {"I", 0},   "t,q(s)",  MATCH_MATRIX_ST, MASK_MATRIX_ST, match_opcode, 0 },

Recompile the GNU toolchain and now let’s write a test program. We assume the matrix’s layout is already block-wise.

// Use our pseudo-instruction.
register VTYPE tmp_a;
register VTYPE tmp_b;
#pragma GCC unroll 0
for (i = 0; i < N_BLOCKS; ++i) {
  #pragma GCC unroll 0
  for (j = 0; j < N_BLOCKS; ++j) {
    // Load the output.
    VTYPE *pc = c2[i][j];
    register VTYPE tmp_c = 0;
    #pragma GCC unroll 0
    for (k = 0; k < N_BLOCKS; ++k) {
      VTYPE *pa = a[i][k];
      VTYPE *pb = b[k][j];
      __asm__ volatile("matrix.ld %[tmp_a], %[a]\n\t"
                       "matrix.ld %[tmp_b], %[b]\n\t"
                       "matrix.multiply %[tmp_c], %[tmp_a], %[tmp_b]\n\t"
                       : [tmp_c] "=&r"(tmp_c), [tmp_a] "=r"(tmp_a), [tmp_b] "=r"(tmp_b)
                       : [a] "r"(pa), [b] "r"(pb));
    }
    __asm__ volatile("matrix.st %[tmp_c], %[c]\n\t"
                     :
                     : [c] "r"(pc), [tmp_c] "r"(tmp_c));
  }
}

There are some interesting points here:

  • We disable loop unrolling here via a pragma. This pragam is supported in GCC 8 (they really should have supported this before).
  • We use temporary register variables to store the loaded and partial results. Notice the "=&r" for tmp_c. This is to make sure that tmp_c does not share a register with other input operands.

Simulate in Spike.

We first want to implement these new instructions in Spike, a ISA simulator for RISC-V. Although we use

Add this to riscv-insn-list in riscv-tools/riscv-isa-sim/riscv/riscv.mk.in:

	matrix_multiply \
	matrix_ld \
	matrix_st \

Declare the disassembly in riscv-tools/riscv-isa-sim/spike_main/disasm.cc:

DEFINE_RTYPE(matrix_multiply);
DEFINE_XLOAD(matrix_ld);
DEFINE_XSTORE(matrix_st);

Also add this to riscv-tools/riscv-isa-sim/fesvr/encoding.h:

DECLARE_INSN(matrix_multiply, MATCH_MATRIX_MULTIPLY, MASK_MATRIX_MULTIPLY)
DECLARE_INSN(matrix_ld, MATCH_MATRIX_LD, MASK_MATRIX_LD)
DECLARE_INSN(matrix_st, MATCH_MATRIX_ST, MASK_MATRIX_ST)

Now it is time to actually implement it in Spike. This involves changing decode.h and add the functionality implementation in riscv/insns/*.h. The architectural states are defined in riscv/processor.h. Take a look at other examples and you will soon figure it out.

Add the instruction in gem5.

Finally, we want to simulate the binary in a cycle-level simulator like gem5. Fortunately, the RISC-V implementation in gem5 has a really nice and extendable instruction decoder (compared to x86). The decoder is defined in src/arch/riscv/isa/decoder.isa. Find the correct place to insert your new instruction based on the binary format. ROp and Load/Store are predefined instruction templates. Notice that gem5’s instruction decoder is a python-based template generator. The input and output registers are detected by the generator analyzing the instruction’s functionality. For example, for our matrix multiply accumulation instruction, since Rd appears on both sides of the assign, it will be marked as both input and output register. Then in the out-of-order execution, there will be dependence chain on the this instruction.

0x02: decode FUNCT3 {
  format ROp {
      0x0: matrix_multiply({{
          Rd = Rd + Rs1 + Rs2;
          xc->tcBase()->getCpuPtr()->matrixUnit.multiply();
      }}, MatrixOp);
  }
  format Load {
      0x1: matrix_ld({{
          Rd = Mem;
          uint64_t reg = xc->readIntRegOperand(this, 0);
          Addr addr = reg + offset;
          xc->tcBase()->getCpuPtr()->matrixUnit.load(xc, addr);
      }});
  }
  format Store {
      0x2: matrix_st({{
          Mem_ud = Rs2_ud;
          uint64_t reg = xc->readIntRegOperand(this, 0);
          Addr addr = reg + offset;
          Mem = xc->tcBase()->getCpuPtr()->matrixUnit.store(xc, addr);
      }});
  }
}

I will not show the details of implementing this within gem5. It’s just labor work.