Implementation:Vllm project Vllm CUMem Allocator

Knowledge Sources	vllm
Domains	GPU Memory Management, CUDA Virtual Memory
Last Updated	2026-02-08 00:00 GMT

Overview

Implements a custom PyTorch CUDAPluggableAllocator using CUDA virtual memory management APIs (cuMemCreate, cuMemMap) for fine-grained GPU memory control.

Description

This file provides a Python-accessible C extension module that bypasses PyTorch's default caching allocator to give vLLM direct control over GPU memory allocation. It uses the CUDA Driver API (cuMemCreate, cuMemMap, cuMemSetAccess) to allocate and map pinned virtual memory with optional GPUDirect RDMA and NVLink fabric handle support. On ROCm, it supports configurable chunk sizes (default 256MB, overridable via VLLM_ROCM_SLEEP_MEM_CHUNK_SIZE environment variable) with multi-chunk allocation and cleanup. The module exposes python_create_and_map and python_unmap_and_release as Python-callable functions, with optional Python callbacks (g_python_malloc_callback, g_python_free_callback) for allocation tracking.

Usage

This file is compiled as a standalone Python C extension module. It is loaded by the vLLM memory management layer to provide custom GPU memory allocation when advanced memory control is needed, such as reducing fragmentation or enabling sleep memory for large model inference.

Code Reference

Source Location

Repository: vllm
File: csrc/cumem_allocator.cpp
Lines: 1-751

Signature

// Helper functions
void ensure_context(unsigned long long device);

void create_and_map(unsigned long long device, ssize_t size,
                    CUdeviceptr d_mem,
                    CUmemGenericAllocationHandle* p_memHandle);

void unmap_and_release(unsigned long long device, ssize_t size,
                       CUdeviceptr d_mem,
                       CUmemGenericAllocationHandle* p_memHandle);

// Python-exposed functions
static PyObject* py_init_module(PyObject* self, PyObject* args);
static PyObject* python_create_and_map(PyObject* self, PyObject* args);
static PyObject* python_unmap_and_release(PyObject* self, PyObject* args);

// Utility
PyObject* create_tuple_from_c_integers(unsigned long long a,
    unsigned long long b, unsigned long long c, unsigned long long d);

Import

#include "cumem_allocator_compat.h"
#include <Python.h>
#include <iostream>

I/O Contract

Inputs

Name	Type	Required	Description
device	unsigned long long	Yes	CUDA device ordinal to allocate on
size	ssize_t	Yes	Allocation size in bytes (must be aligned to granularity)
d_mem	CUdeviceptr	Yes	Virtual address to map the allocation to
p_memHandle	CUmemGenericAllocationHandle*	Yes	Output handle for the allocated memory
g_python_malloc_callback	PyObject*	No	Optional Python callback invoked on allocation
g_python_free_callback	PyObject*	No	Optional Python callback invoked on free

Outputs

Name	Type	Description
(return)	PyObject*	Python tuple of (device, size, d_mem, memHandle) for create_and_map
error_msg	char[10240]	Human-readable error message buffer on CUDA failure
error_code	CUresult	CUDA error code (0 for success)

Usage Examples

// From Python via the C extension module:
// 1. Initialize the module with optional malloc/free callbacks
//    py_init_module(malloc_callback, free_callback)
//
// 2. Create and map GPU memory
//    device, size, d_mem, handle = python_create_and_map(device, size, d_mem, handle)
//
// 3. Release GPU memory
//    python_unmap_and_release(device, size, d_mem, handle)

// C-level usage:
unsigned long long device = 0;
ssize_t size = 1024 * 1024 * 256; // 256MB
CUdeviceptr d_mem;
cuMemAddressReserve(&d_mem, size, 0, 0, 0);
CUmemGenericAllocationHandle memHandle;
create_and_map(device, size, d_mem, &memHandle);
// ... use memory ...
unmap_and_release(device, size, d_mem, &memHandle);

Related Pages

Environment:Vllm_project_Vllm_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment