How well does GPU saturate bandwidth?

2025/10/04

As a computer architect, bandwidth and compute throughput are key metrics for understanding hardware capability and finding system bottlenecks. While implementing a fused GNN kernel, I started wondering: How well can GPUs saturate available memory bandwidth?

More specifically: Given a fixed access granularity (determined by the workload), can we saturate the available data bandwidth? From computer organization fundamentals, the number of in-flight requests N needed to saturate bandwidth B with latency L and request size S is:

\[ N=\frac{B\times L}{S} \]

This N translates directly to core resource requirements (registers, threads, scratchpad buffers). Understanding GPU bandwidth saturation helps answer:

Experimental Setup

To answer these questions, I ran simple tests using different memory loading patterns (code here). Each SM uses producer warps to load data into L1 scratchpad, with one consumer warp performing minimal computation to ensure we’re memory-bound. The L1 scratchpad is divided into multiple stages, each holding a chunk, using mbarrier for synchronization.

Test Configuration:

The results at the end show both raw bandwidth and utilization percentage for different implementations described below:

Results

Normal Load Instructions

The most traditional approach uses load instructions that store data into the register file. The thread blocks until data is ready before issuing the next instruction.

Key findings:

Asynchronous Load (cp.async)

A clear disadvantage of normal loads is the massive thread count needed for parallelism. When data consumption isn’t fine-grained, we can leverage asynchronous loads. NVIDIA GPUs provide cp.async instructions that don’t block threads on load completion. Instead, threads issue cp.async.commit_group to commit previous instructions and cp.async.wait_group to wait for data. Importantly, it loads data directly into shared memory, eliminating register buffering and reducing register pressure. The maximum cp.async request size is 16B, so a single warp with 32 threads can fetch 512B.

Key findings:

Tensor Memory Accelerator (TMA)

Since Hopper, GPUs extend cp.async with bulk movements (cp.async.bulk). The Tensor Memory Accelerator (TMA) is a DMA engine that can load a 5D tensor with a single instruction, eliminating the need for multiple producer warps.

Key findings:

Theoretical Analysis

Let’s do some back-of-envelope calculations to check our results. With 2GHz frequency, 5.6TB/s L2 bandwidth, and 170 cores:

There’s still some gap between theory and measurement, but this provides a reasonable first-order approximation.

Key Takeaway: Coarse-Grained Access is More Efficient

Beyond using these numbers as a reference for understanding program performance or estimating achievable bandwidth before writing kernels, the biggest takeaway is clear: while GPUs provide fine-grained abstractions (threads, byte loads), it’s far more efficient to orchestrate data movement in coarse-grained chunks (2-4kB).

We already see this principle in action across modern GPU kernels:

Keep this in mind when adapting new algorithms for GPU - prioritize larger access granularities over fine-grained operations.


Full code available on GitHub. Feel free to test your own GPU and submit results via pull request!

GPU Bandwidth