Principle:FMInference FlexLLMGen CUDA Memory Access Patterns
| Knowledge Sources | |
|---|---|
| Domains | CUDA Programming, Memory Systems, GPU Optimization |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Techniques for maximizing GPU memory bandwidth utilization through vectorized access widths, explicit cache policy control, predicated loads, and hardware-accelerated asynchronous data movement between memory tiers.
Description
GPU kernel performance is predominantly determined by how efficiently data is moved through the memory hierarchy. The key principle is that wider, aligned memory accesses with appropriate cache hints achieve significantly higher bandwidth than naive scalar loads. Additionally, on modern architectures, asynchronous copy instructions can move data directly from global memory to shared memory without consuming register file bandwidth, enabling software pipelining patterns that overlap memory access with computation.
Three orthogonal dimensions of optimization are addressed:
Access width: Issuing 16-byte vectorized loads/stores (e.g., ld.global.v4.u32) achieves higher throughput than four separate 4-byte loads because it reduces instruction count and improves memory coalescing.
Cache policy: Different access patterns benefit from different cache behaviors. Data that will be reused should use CacheAll (the default .ca qualifier). Data accessed once in a streaming fashion should use CacheStreaming (.cs) to avoid polluting the L1 cache. Data shared across thread blocks but not within a single block should use CacheGlobal (.cg) to cache only at L2.
Asynchronous copy: The cp.async instruction (available on Ampere and later architectures) copies data directly from global memory to shared memory without staging through registers. This frees the register file for computation and enables a producer-consumer pipeline where the memory subsystem loads data for a future computation stage while the current stage processes previously loaded data.
Usage
Apply these patterns in any memory-bandwidth-bound GPU kernel. Transformer inference kernels, which are dominated by large matrix operations and element-wise operations on high-dimensional tensors, benefit substantially from vectorized access and cache policy tuning.
Theoretical Basis
Vectorized Memory Access
NVIDIA GPUs execute memory operations at the warp level (32 threads). When each thread issues a 16-byte load, the hardware coalesces these into 512-byte transactions that match the cache line granularity. This achieves near-peak memory bandwidth. Using smaller access widths (4 or 8 bytes) may result in partial cache line utilization, wasting bandwidth.
The trade-off is register pressure: a 16-byte load consumes 4 registers per thread, while a 4-byte load consumes 1. Kernels with high register pressure may need to use narrower accesses.
Cache Policy Selection
The GPU L1 cache (typically 128KB per SM, configurable) and L2 cache (several MB) have limited capacity. Cache policy hints help the hardware make eviction decisions:
- CacheAll (.ca): Data is cached in both L1 and L2. Best for data with temporal locality (reused within the kernel).
- CacheGlobal (.cg): Data is cached only in L2, bypassing L1. Best for data shared across thread blocks but not reused within a single thread block's execution.
- CacheStreaming (.cs): Data is loaded with an evict-first policy, meaning it will be the first candidate for eviction. Best for streaming data that will not be reused.
Incorrect cache policy selection can reduce effective bandwidth by 20-40% due to cache thrashing.
Predicated Loads and Boundary Handling
When processing tensors whose dimensions are not multiples of the vectorized access width, boundary threads must be prevented from accessing out-of-bounds memory. Predicated loads (using PTX predicate registers) conditionally execute the memory access and zero-initialize the destination when the predicate is false. This avoids branch divergence within a warp, which would serialize the two execution paths.
Asynchronous Copy Pipeline
The cp.async instruction enables a multi-stage software pipeline:
- Stage N: Issue async copies for data needed in stage N+k.
- Fence: Commit the copy group with
cp.async.commit_group. - Wait: Before using data from stage N, wait with
cp.async.wait_group<k-1>to ensure stage N's data has arrived.
This pattern is managed by commit groups (fences) and parameterized wait operations. A BufferTracker utility manages the circular buffer index for multi-stage pipelines, typically using 2-6 stages depending on the ratio of compute to memory latency.
The tail_complete_wait functions handle the pipeline drain at the end of processing, where fewer than k stages remain in flight and the wait parameter must be adjusted dynamically.