As a computer architect, bandwidth and compute throughput are key metrics for understanding hardware capability and finding system bottlenecks. While implementing a fused GNN kernel, I started wondering: How well can GPUs saturate available memory bandwidth?
More specifically: Given a fixed access granularity (determined by the workload), can we saturate the available data bandwidth? From computer organization fundamentals, the number of in-flight requests N needed to saturate bandwidth B with latency L and request size S is:
\[ N=\frac{B\times L}{S} \]This N translates directly to core resource requirements (registers, threads, scratchpad buffers). Understanding GPU bandwidth saturation helps answer:
- What is the expected bandwidth for this workload on this GPU?
- How can I write programs to better saturate bandwidth?
- Can algorithm co-optimization improve bandwidth utilization?
Experimental Setup
To answer these questions, I ran simple tests using different memory loading patterns (code here). Each SM uses producer warps to load data into L1 scratchpad, with one consumer warp performing minimal computation to ensure we’re memory-bound. The L1 scratchpad is divided into multiple stages, each holding a chunk, using mbarrier
for synchronization.
Test Configuration:
- GPU: RTX 5090
- Core frequency: 2GHz (fixed)
- Max L2 bandwidth: 5.6TB/s
- Data cached in L2 (testing L2 bandwidth saturation)
The results at the end show both raw bandwidth and utilization percentage for different implementations described below:
Results
Normal Load Instructions
The most traditional approach uses load instructions that store data into the register file. The thread blocks until data is ready before issuing the next instruction.
Key findings:
- Parallelism scales with threads: The number of in-flight requests largely depends on the number of threads, with each using registers to hold loaded values. Doubling producer warps roughly doubles bandwidth - for 512B chunks, one producer achieves ~206 GB/s while two reach ~400 GB/s.
- Dramatic chunk size effect: There’s a significant jump from 2kB to 4kB chunks. Even with a single producer, 4kB nearly triples bandwidth compared to 2kB (1728 GB/s vs. 598 GB/s). My wild guess is some aggressive hardware coalescing once chunk size exceeds a threshold.
- High resource requirements: To achieve 80%+ utilization, you need 16 producers each loading 4kB chunks - a substantial thread count.
Asynchronous Load (cp.async
)
A clear disadvantage of normal loads is the massive thread count needed for parallelism. When data consumption isn’t fine-grained, we can leverage asynchronous loads. NVIDIA GPUs provide cp.async
instructions that don’t block threads on load completion. Instead, threads issue cp.async.commit_group
to commit previous instructions and cp.async.wait_group
to wait for data. Importantly, it loads data directly into shared memory, eliminating register buffering and reducing register pressure. The maximum cp.async
request size is 16B, so a single warp with 32 threads can fetch 512B.
Key findings:
- Superior efficiency: Easily outperforms normal loads, even with a single producer, as expected given the reduced register pressure.
- Stage insensitivity: Performance isn’t sensitive to stage count beyond 2, which makes sense since we’re memory-bound and additional stages don’t help hide latency.
- Scaling with threshold: More producer warps help due to increased parallel in-flight requests, but there’s a chunk size threshold. Once chunks reach 4kB, even a single producer can saturate bandwidth.
Tensor Memory Accelerator (TMA)
Since Hopper, GPUs extend cp.async
with bulk movements (cp.async.bulk
). The Tensor Memory Accelerator (TMA) is a DMA engine that can load a 5D tensor with a single instruction, eliminating the need for multiple producer warps.
Key findings:
- Similar patterns: Generally follows
cp.async
behavior but with lower absolute performance. - Limited small-chunk performance: For 512B chunks, peak utilization is only 9.1%.
- Stage limitation mystery: Having more than 2 stages doesn’t help much, even when clearly under-utilizing bandwidth. This suggests the TMA unit has an implicit internal window for processing requests, so issuing more than 2 TMA requests doesn’t improve bandwidth.
Theoretical Analysis
Let’s do some back-of-envelope calculations to check our results. With 2GHz frequency, 5.6TB/s L2 bandwidth, and 170 cores:
- Bandwidth per core per cycle: 5.6TB/s ÷ 2GHz ÷ 170 = 16.47B
- Assuming typical 300-cycle latency: We need 4.8kB in-flight requests to saturate bandwidth
There’s still some gap between theory and measurement, but this provides a reasonable first-order approximation.
Key Takeaway: Coarse-Grained Access is More Efficient
Beyond using these numbers as a reference for understanding program performance or estimating achievable bandwidth before writing kernels, the biggest takeaway is clear: while GPUs provide fine-grained abstractions (threads, byte loads), it’s far more efficient to orchestrate data movement in coarse-grained chunks (2-4kB).
We already see this principle in action across modern GPU kernels:
- Tile-based GEMM implementations
- Flash-attention algorithms
- Many large language model kernels
Keep this in mind when adapting new algorithms for GPU - prioritize larger access granularities over fine-grained operations.
Full code available on GitHub. Feel free to test your own GPU and submit results via pull request!
