Implementation:FMInference FlexLLMGen DeepSpeed Memory Access Utils

Knowledge Sources	FMInference_FlexLLMGen
Domains	CUDA Programming, Memory Systems, GPU Optimization
Last Updated	2026-02-09 12:00 GMT

Overview

Header-only CUDA utility library providing PTX-level optimized functions for global memory loads and stores, shared memory access, and asynchronous memory copy operations with configurable cache policies.

Description

This file defines the mem_access namespace containing a comprehensive set of template-specialized memory access functions for CUDA kernels. It provides four categories of operations:

1. Global Memory Loads (load_global): Template-specialized for access sizes of 4, 8, and 16 bytes, with three cache policies: CacheAll (cache at all levels, ld.global.ca), CacheGlobal (L2 only, ld.global.cg), and CacheStreaming (evict-first, ld.global.cs). Each variant has a predicated overload that conditionally executes the load based on a boolean flag (zero-initializing the destination when the predicate is false).

2. Shared Memory Loads (load_shared): Access shared memory with explicit PTX shared memory addressing via cvta.to.shared address conversion. Supports 4, 8, and 16 byte widths with optional predication.

3. Global and Shared Memory Stores (store_global, store_shared): Corresponding store operations with Writeback, CacheGlobal, and CacheStreaming store policies for global memory.

4. Asynchronous Memory Copy (memcpy_async): When ASYNC_COPY_AVAILABLE (Ampere+), provides cp.async instruction wrappers for direct global-to-shared memory transfers that bypass registers. Includes variants with no-op predication, zero-fill predication, combined zero+nop predication, and cache-global (cp.async.cg) variants. Pipeline management functions (memcpy_async_fence, memcpy_async_wait) and a BufferTracker utility class for managing multi-stage pipeline buffers are also provided.

All functions fall back to portable C++ implementations when PTX_AVAILABLE is not defined, ensuring compatibility with non-NVIDIA platforms.

Usage

Include this header in CUDA kernels that require fine-grained control over memory access patterns, cache behavior, and software pipelining. It is used extensively throughout DeepSpeed's custom CUDA kernels for transformers.

Code Reference

Source Location

Repository: FMInference_FlexLLMGen
File: benchmark/third_party/DeepSpeed/csrc/includes/memory_access_utils.h
Lines: 1-1001

Signature

namespace mem_access {

enum class LoadPolicy { CacheAll, CacheGlobal, CacheStreaming };
enum class StorePolicy { Writeback, CacheGlobal, CacheStreaming };

// Global memory loads (4, 8, 16 bytes; 3 cache policies; with/without predicate)
template <int AccessSize, LoadPolicy policy = LoadPolicy::CacheAll>
__device__ __forceinline__ void load_global(void* dst, const void* src);

template <int AccessSize, LoadPolicy policy = LoadPolicy::CacheAll>
__device__ __forceinline__ void load_global(void* dst, const void* src, bool do_access);

// Shared memory loads (4, 8, 16 bytes; with/without predicate)
template <int AccessSize>
__device__ __forceinline__ void load_shared(void* dst, const void* src);

// Global memory stores (4, 8, 16 bytes; 3 store policies)
template <int AccessSize, StorePolicy policy = StorePolicy::Writeback>
__device__ __forceinline__ void store_global(void* dst, const void* src);

// Shared memory stores (4, 8, 16 bytes)
template <int AccessSize>
__device__ __forceinline__ void store_shared(void* dst, const void* src);

// Async copy (Ampere+): global -> shared bypassing registers
template <int AccessSize>
__device__ __forceinline__ void memcpy_async(void* shr, const void* gbl);

__device__ __forceinline__ void memcpy_async_fence();

template <int stages>
__device__ __forceinline__ void memcpy_async_wait();

// Pipeline buffer tracker
template <int max>
class BufferTracker { ... };

// Warp lane ID utility
__device__ __forceinline__ uint32_t lane_id();

}  // namespace mem_access

Import

#include "memory_access_utils.h"

I/O Contract

Inputs

Name	Type	Required	Description
AccessSize	int (template)	Yes	Number of bytes per access: 4, 8, or 16
policy	LoadPolicy or StorePolicy (template)	No	Cache behavior hint for hardware. Default: CacheAll / Writeback
dst	void*	Yes	Destination buffer pointer (register-local for loads, memory for stores)
src	const void*	Yes	Source memory pointer (global/shared for loads, register-local for stores)
do_access	bool	No	Predicate flag; when false, destination is zero-initialized instead of loaded

Outputs

Name	Type	Description
dst contents	4/8/16 bytes	Loaded data (for load functions) or zero (when predicate is false)
memory at dst	4/8/16 bytes	Written data (for store functions)

Usage Examples

#include "memory_access_utils.h"

__global__ void example_kernel(const float* input, float* output, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    // Load 16 bytes (4 floats) with L2-only caching
    float4 data;
    bool valid = (idx * 4 + 3) < n;
    mem_access::load_global<16, mem_access::LoadPolicy::CacheGlobal>(
        &data, input + idx * 4, valid);

    // Process data...
    data.x *= 2.0f;

    // Store 16 bytes with streaming (evict-first) policy
    if (valid) {
        mem_access::store_global<16, mem_access::StorePolicy::CacheStreaming>(
            output + idx * 4, &data);
    }
}

// Async copy pipeline example (Ampere+)
__global__ void pipelined_kernel(const float* gbl_input,
                                  float* output, int n) {
    __shared__ float smem_buf[2][256];  // Double buffer
    mem_access::BufferTracker<2> tracker;

    int buf = tracker.get();
    mem_access::memcpy_async<16>(&smem_buf[buf][threadIdx.x * 4],
                                  gbl_input + threadIdx.x * 4);
    mem_access::memcpy_async_fence();

    // Wait for first stage to complete
    mem_access::memcpy_async_wait<0>();
    __syncthreads();

    // Process from shared memory
    output[threadIdx.x] = smem_buf[buf][threadIdx.x * 4];
}

Related Pages

Principle:FMInference_FlexLLMGen_CUDA_Memory_Access_Patterns

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment