## Infinity Stream Portable and Programmer-Friendly In-/Near-Memory Fusion

Zhengrong Wang<sup>1</sup>, <u>Christopher Liu<sup>1</sup></u>, Aman Arora<sup>2</sup>, Lizy John<sup>2</sup>, Tony Nowatzki<sup>1</sup> <sup>1</sup>UCLA, <sup>2</sup>UT Austin









## **Von Neumann Bottleneck**

Performance limited by memory bottleneck





### Memory

## **Von Neumann Bottleneck**

Performance limited by memory bottleneck





### Memory



## **Compute Paradigms**

More Expressive Less Parallelism



**de Supports complex control flow** 

**Von Neumann bottleneck** 

**Programs similar to in-core** without control flow



Less Expressive More Parallelism

### **Near-Memory**

Spatially Local Simple Cores

### **In-Memory**

Massive Vector Processing





Limited parallelism

- Massive amounts of parallelism
- **Difficult to program due to many** restrictions





## **Example: PointNet++**





## There is an Adoption Problem





## Programming is Difficult

Mat matmul(Mat core\_A, Mat core\_B, Mat core\_C): InMat in\_A = inMemcpy(core\_A, M \* N, fromCore) InMat in\_B = inMemcpy(core\_B, M \* N, fromCore) InMat in\_C = inMalloc(M \* N) for m in [0, M): for n in [0, N): InVec in\_V = in\_A[m][:] \* in\_B[:][c] in\_V = inPartialReduce(+, in\_V, rounds=3) NearVec near\_V = nearMalloc(K) near\_V = nearMemcpy(in\_V, K / 2 / 2 / 2, fromIn) NearScalar near\_dotpdt = nearReduce(near\_V) core\_C[m][n] = coreMemcpy(near\_dotpdt, fromNear)





## Hide Hardware Details



```
Mat matmul(Mat A, Mat B, Mat C):
  for m in [0, M):
    for n in [0, N):
      for k in [0, K):
        C[m][n] += A[m][k] * B[k][n]
```

#### **Easier to program**



## Outline



## **Unified Abstraction** Agnostically represents 1. In-core compute 2. In-memory compute 3. Near-memory compute Plain C Code Partitioning Static vs Dynamic

Portable binary without exposing µArch details to compiler





Specialize binary to take advantage of µArch details

µArch Extensions

Hardware support for compute paradigms



## **Near-Memory Compute**

Offload computation closer to the data

Supports complex memory access patterns



with **reduction** capabilities

Limitation: Lower compute width





### Augment memory technology with compute

C[i] = A[i] & B[i]





### Augment memory technology with compute



**SRAM Array** 





### Augment memory technology with compute

C[i] = A[i] & B[i]

#### Standard Data Layout

Bitlines (256)









### Augment memory technology with compute

### C[i] = A[i] & B[i]

#### **Bit-Serial Data Layout**

Bitlines (256)

## Supported Operations \_\_\_\_ .... +-\*/%&| ^ << >> sin cos exp log sqrt Offers massive vector parallelism (256) Wordlines PEs (A op B) **Bitwise Operations**

**SRAM Array** 









### Augment memory technology with compute

#### **Bit-Serial Data Layout**







## **Compute Paradigms**

## **In-Memory Compute**

Offers **massive** vector parallelism

**Challenge:** Requires data alignment & transpose

## **Near-Memory Compute**

Supports complex memory access patterns with **reduction** capabilities

**Limitation:** Lower compute width

## Programming Considerations

## **1. Orchestration**

Meet the requirements for each compute paradigm

- Data alignment  $\bullet$
- Data layout & tiling
- Bit-serial transpose
- Managing on-chip space

## 2. Fused In-/Near-Memory

Statically & dynamically take advantage of each compute paradigm

- **In-Memory:** Large input size & element-wise compute
- **Near-Memory:** Small input size or irregular memory patterns •
- **Fusion** of in-/near-memory

### **3. Portability**

Target a large variety of microarchitectures with a single binary



## Programming Considerations

## **1. Orchestration**

Meet the requirements for each compute paradigm

- Data alignment ullet
- Data layout & tiling ullet
- Bit-serial transpose
  - Managing on-chip space

## 2. Fused In-/Near-Memory

Statically & dynamically take advantage of each compute paradigm

- **In-Memory:** Large input size & element-wise compute •
- **Near-Memory:** Small input size or irregular memory patterns  $\bullet$
- **Fusion** of in-/near-memory

### **3. Portability**

Target a large variety of microarchitectures with a single binary

## Transparency minimizes programmer burden

### **Plain C Code**

#### **Unified Abstraction**

Agnostically represents 1. In-core compute 2. In-memory compute 3. Near-memory compute

> Partitioning Static vs Dynamic

Portable binary without exposing µArch details to compiler





Specialize binary to take advantage of µArch details

µArch Extensions

Hardware support for compute paradigms



## **Dataflow Representation**



## **Near-Memory Compute**

- Memory access pattern
- Computation

## **In-Memory Compute**

- N-D Tensors
- Data alignment
- Spatial reuse

### **Fused In-/Near-Memory**

- Reduction operations
- Fusion

## Stream Dataflow Graph

• Memory access pattern

# 1D Filter for i in [0, N-1): B[i] = F[0] × A[i] + F[1] × A[i+1]

Computation
 N-D Tensors
 Data alignment
 Reduction operations
 Spatial reuse
 Fusion



## Stream Dataflow Graph

• Memory access pattern

# 1D Filter for i in [0, N-1): B[i] = F[0] × A[i] + F[1] × A[i+1]

N-D Tensors
 Data alignment
 Reduction operations
 Spatial reuse
 Fusion

Computation



 Memory access pattern Computation 0

N-D Tensors

0

### **1D** Filter for i in [0, N-1): $B[i] = F[0] \times A[i]$

+  $F[1] \times A[i+1]$ 

Data alignment 0 **Reduction operations** Ο Spatial reuse 0 Fusion 0

**Observation:** contiguous memory access may be tensorized



But we need to map data to hardware





Memory access pattern

- Computation
- **N-D** Tensors 0

**1D** Filter

for i in [0, N-1):  $B[i] = F[0] \times A[i]$ 

+ F[1] × A[i+1]

#### **Global Lattice Space**

Data alignment

0

Each cell represents a virtual vector lane

**Describes N-dimensional alignment** 

.......... ..........

**Reduction operations** 0 Spatial reuse 0 Fusion 0

**Observation:** contiguous memory access may be tensorized







Memory access pattern

- Computation 0
- N-D Tensors 0

**1D Filter** 

for i in [0, N-1):  $B[i] = F[0] \times A[i]$ 

Data alignment

0

Ο

0

Ο



Memory access is represented as a hyperrectangle



#### Memory access pattern

- Computation 0
- N-D Tensors Ο

#### **1D Filter**

for i in [0, N-1):





Memory access is represented as a hyperrectangle

# . . . . . . . . . . . .

- Memory access pattern
- Computation 0 N-D Tensors 0
- Data alignment 0

for m in [0, M): for k in [0, K): C[m] += A[m][k] \* B[k]

#### Example







 $\bullet$ 

Α

В

Spatial reuse Fusion

0

0



- Memory access pattern
- ComputationN-D Tensors
- N-D lensors
- o Data alignment

## for m in [0, M): C[m] = sum(A[m][:] \* B[k])

Vectorize

#### Example

• Reduction operations





Α

В

Spatial reuse Fusion

0



Ξ

- Memory access patternComputation
- N-D Tensors
- o Data alignment

for m in [0, M): Spatial
 C[m] = sum(A[m][:] \* B[k])

#### Example

• Reduction operations





Α

В

Spatial reuse Fusion

0





=

- Memory access pattern Computation 0
- N-D Tensors 0
- Data alignment 0

## for m in [0, M): Spatial C[m] = sum(A[m][:] \* B[k])

#### Example

**Reduction operations** 0





В

Spatial reuse Fusion

0





=

- Memory access pattern Computation 0
- N-D Tensors 0
- Data alignment 0

## for m in [0, M): Spatial C[m] = sum(A[m][:] \* B[k])

#### Example

**Reduction operations** 0





В

Spatial reuse Fusion

0





=

- Memory access pattern
- Computation 0 N-D Tensors 0
- Data alignment 0

for m in [0, M): Spatial
 C[m] = sum(A[m][:] \* B[k])

**Reduction Tree** 

 $A_m \times B$ ╋ +

**Reduction operations** 0

> Spatial reuse Fusion



- Memory access patternComputation
- N-D Tensors
- o Data alignment
- Reduction operations

Spatial

for m in [0, M):
 C[m] = sum(A[m][:] \* B[k])

Idea: Expose more parallelism through spatial reuse

#### Example



Spatial reuse







- Memory access pattern 0
- Ο
- 0
- **Reduction operations** Ο



- Memory access pattern 0
- 0
- Data alignment 0
- **Reduction operations** Ο



- Memory access pattern
- 0
- 0
- Ο





- Memory access pattern 0
- 0
- 0
- Ο











Ο

## **Unified Abstraction** Agnostically represents 1. In-core compute 2. In-memory compute 3. Near-memory compute **Plain C Code** Partitioning Static vs Dynamic

Portable binary without exposing µArch details to compiler







## Mapping to Hardware

#### **Global Lattice**



#### **SRAM Banks**



## In-Memory Organization

#### Question

What are the effects of tiling on on inter-bank traffic?



#### Each last-level cache bank consists of many in-memory SRAM arrays







## In-Memory Organization

# array array

#### Question

What are the effects of tiling on on inter-bank traffic?



Abstract many SRAM arrays into a single large SRAM array







Lattice cells are a sub-matrix of some larger matrix























































#### **Inter-Bank Data Movement**

<

Row

Square



**Q**4

















#### **Inter-Bank Data Movement**

Row

Square









## Inter-bank Data Movement depends on Tiling Size + Move Direction

## Inter-bank Data Movement depends on **Tiling Size + Move Direction**

Static

Compiler

#### **1. Inter-Bank Traffic:**

2. Data Alignment: Best tiling size that minimizes inter-bank data movement Tensors A & B must have matching tiling size Tensors (A) is moved along dimensions  $d_1$ ,  $d_2$ ,  $d_3$ , ... are used together Tensor

Dynamic

**Runtime & JIT Compiler** 

#### Layout Hints

## **Microarchitecture Extensions**



## **Tensor Transpose Unit**

Transposes array elements to and from bit-serial format

## Layout Override Table

Tracks which arrays are currently transposed

More details in paper





**Unified Abstraction** Agnostically represents 1. In-core compute 2. In-memory compute 3. Near-memory compute **Plain C Code** 

4

#### Partitioning Static vs Dynamic

Portable binary without exposing µArch details to compiler





Specialize binary to take advantage of µArch details

µArch Extensions

Hardware support for compute paradigms





## **Static** Plain C Code **Unified Abstraction** Agnostically represents 1. In-core compute 2. In-memory compute 3. Near-memory compute









**Static** 

#### Plain C Code

#### **Unified Abstraction**

Agnostically represents 1. In-core compute 2. In-memory compute 3. Near-memory compute

## Key Challenges



![](_page_57_Figure_0.jpeg)

![](_page_57_Picture_1.jpeg)

![](_page_57_Picture_2.jpeg)

![](_page_57_Figure_3.jpeg)

![](_page_58_Figure_0.jpeg)

![](_page_58_Figure_1.jpeg)

Data alignment & layout
Portable binary
Decide between in-core, in-/near-memory, and fusion
Managing on-chip space
Tiling
Bit-serial transpose

![](_page_59_Figure_0.jpeg)

![](_page_59_Picture_2.jpeg)

## Simulator

gem5 with partial AVX-512 support

## Hardware Configuration

- 64 Cores (8x8)
- 18 ways (16 SRAM arrays/way)
- 8KB SRAM arrays (256x256)

## Simulation Methodology

![](_page_60_Picture_9.jpeg)

## **Fused Execution Speedup**

![](_page_61_Figure_1.jpeg)

• 8KB SRAM arrays (256x256)

![](_page_62_Figure_0.jpeg)

![](_page_63_Figure_0.jpeg)

![](_page_63_Figure_2.jpeg)

![](_page_63_Figure_3.jpeg)

![](_page_63_Picture_4.jpeg)

![](_page_63_Picture_5.jpeg)

![](_page_63_Picture_6.jpeg)