Implementation:FMInference FlexLLMGen DeepSpeed Memory Access Utils
| Knowledge Sources | |
|---|---|
| Domains | CUDA Programming, Memory Systems, GPU Optimization |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Header-only CUDA utility library providing PTX-level optimized functions for global memory loads and stores, shared memory access, and asynchronous memory copy operations with configurable cache policies.
Description
This file defines the mem_access namespace containing a comprehensive set of template-specialized memory access functions for CUDA kernels. It provides four categories of operations:
1. Global Memory Loads (load_global): Template-specialized for access sizes of 4, 8, and 16 bytes, with three cache policies: CacheAll (cache at all levels, ld.global.ca), CacheGlobal (L2 only, ld.global.cg), and CacheStreaming (evict-first, ld.global.cs). Each variant has a predicated overload that conditionally executes the load based on a boolean flag (zero-initializing the destination when the predicate is false).
2. Shared Memory Loads (load_shared): Access shared memory with explicit PTX shared memory addressing via cvta.to.shared address conversion. Supports 4, 8, and 16 byte widths with optional predication.
3. Global and Shared Memory Stores (store_global, store_shared): Corresponding store operations with Writeback, CacheGlobal, and CacheStreaming store policies for global memory.
4. Asynchronous Memory Copy (memcpy_async): When ASYNC_COPY_AVAILABLE (Ampere+), provides cp.async instruction wrappers for direct global-to-shared memory transfers that bypass registers. Includes variants with no-op predication, zero-fill predication, combined zero+nop predication, and cache-global (cp.async.cg) variants. Pipeline management functions (memcpy_async_fence, memcpy_async_wait) and a BufferTracker utility class for managing multi-stage pipeline buffers are also provided.
All functions fall back to portable C++ implementations when PTX_AVAILABLE is not defined, ensuring compatibility with non-NVIDIA platforms.
Usage
Include this header in CUDA kernels that require fine-grained control over memory access patterns, cache behavior, and software pipelining. It is used extensively throughout DeepSpeed's custom CUDA kernels for transformers.
Code Reference
Source Location
- Repository: FMInference_FlexLLMGen
- File: benchmark/third_party/DeepSpeed/csrc/includes/memory_access_utils.h
- Lines: 1-1001
Signature
namespace mem_access {
enum class LoadPolicy { CacheAll, CacheGlobal, CacheStreaming };
enum class StorePolicy { Writeback, CacheGlobal, CacheStreaming };
// Global memory loads (4, 8, 16 bytes; 3 cache policies; with/without predicate)
template <int AccessSize, LoadPolicy policy = LoadPolicy::CacheAll>
__device__ __forceinline__ void load_global(void* dst, const void* src);
template <int AccessSize, LoadPolicy policy = LoadPolicy::CacheAll>
__device__ __forceinline__ void load_global(void* dst, const void* src, bool do_access);
// Shared memory loads (4, 8, 16 bytes; with/without predicate)
template <int AccessSize>
__device__ __forceinline__ void load_shared(void* dst, const void* src);
// Global memory stores (4, 8, 16 bytes; 3 store policies)
template <int AccessSize, StorePolicy policy = StorePolicy::Writeback>
__device__ __forceinline__ void store_global(void* dst, const void* src);
// Shared memory stores (4, 8, 16 bytes)
template <int AccessSize>
__device__ __forceinline__ void store_shared(void* dst, const void* src);
// Async copy (Ampere+): global -> shared bypassing registers
template <int AccessSize>
__device__ __forceinline__ void memcpy_async(void* shr, const void* gbl);
__device__ __forceinline__ void memcpy_async_fence();
template <int stages>
__device__ __forceinline__ void memcpy_async_wait();
// Pipeline buffer tracker
template <int max>
class BufferTracker { ... };
// Warp lane ID utility
__device__ __forceinline__ uint32_t lane_id();
} // namespace mem_access
Import
#include "memory_access_utils.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| AccessSize | int (template) | Yes | Number of bytes per access: 4, 8, or 16 |
| policy | LoadPolicy or StorePolicy (template) | No | Cache behavior hint for hardware. Default: CacheAll / Writeback |
| dst | void* | Yes | Destination buffer pointer (register-local for loads, memory for stores) |
| src | const void* | Yes | Source memory pointer (global/shared for loads, register-local for stores) |
| do_access | bool | No | Predicate flag; when false, destination is zero-initialized instead of loaded |
Outputs
| Name | Type | Description |
|---|---|---|
| dst contents | 4/8/16 bytes | Loaded data (for load functions) or zero (when predicate is false) |
| memory at dst | 4/8/16 bytes | Written data (for store functions) |
Usage Examples
#include "memory_access_utils.h"
__global__ void example_kernel(const float* input, float* output, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Load 16 bytes (4 floats) with L2-only caching
float4 data;
bool valid = (idx * 4 + 3) < n;
mem_access::load_global<16, mem_access::LoadPolicy::CacheGlobal>(
&data, input + idx * 4, valid);
// Process data...
data.x *= 2.0f;
// Store 16 bytes with streaming (evict-first) policy
if (valid) {
mem_access::store_global<16, mem_access::StorePolicy::CacheStreaming>(
output + idx * 4, &data);
}
}
// Async copy pipeline example (Ampere+)
__global__ void pipelined_kernel(const float* gbl_input,
float* output, int n) {
__shared__ float smem_buf[2][256]; // Double buffer
mem_access::BufferTracker<2> tracker;
int buf = tracker.get();
mem_access::memcpy_async<16>(&smem_buf[buf][threadIdx.x * 4],
gbl_input + threadIdx.x * 4);
mem_access::memcpy_async_fence();
// Wait for first stage to complete
mem_access::memcpy_async_wait<0>();
__syncthreads();
// Process from shared memory
output[threadIdx.x] = smem_buf[buf][threadIdx.x * 4];
}