Implementation:Vllm project Vllm CUMem Allocator
| Knowledge Sources | |
|---|---|
| Domains | GPU Memory Management, CUDA Virtual Memory |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Implements a custom PyTorch CUDAPluggableAllocator using CUDA virtual memory management APIs (cuMemCreate, cuMemMap) for fine-grained GPU memory control.
Description
This file provides a Python-accessible C extension module that bypasses PyTorch's default caching allocator to give vLLM direct control over GPU memory allocation. It uses the CUDA Driver API (cuMemCreate, cuMemMap, cuMemSetAccess) to allocate and map pinned virtual memory with optional GPUDirect RDMA and NVLink fabric handle support. On ROCm, it supports configurable chunk sizes (default 256MB, overridable via VLLM_ROCM_SLEEP_MEM_CHUNK_SIZE environment variable) with multi-chunk allocation and cleanup. The module exposes python_create_and_map and python_unmap_and_release as Python-callable functions, with optional Python callbacks (g_python_malloc_callback, g_python_free_callback) for allocation tracking.
Usage
This file is compiled as a standalone Python C extension module. It is loaded by the vLLM memory management layer to provide custom GPU memory allocation when advanced memory control is needed, such as reducing fragmentation or enabling sleep memory for large model inference.
Code Reference
Source Location
- Repository: vllm
- File: csrc/cumem_allocator.cpp
- Lines: 1-751
Signature
// Helper functions
void ensure_context(unsigned long long device);
void create_and_map(unsigned long long device, ssize_t size,
CUdeviceptr d_mem,
CUmemGenericAllocationHandle* p_memHandle);
void unmap_and_release(unsigned long long device, ssize_t size,
CUdeviceptr d_mem,
CUmemGenericAllocationHandle* p_memHandle);
// Python-exposed functions
static PyObject* py_init_module(PyObject* self, PyObject* args);
static PyObject* python_create_and_map(PyObject* self, PyObject* args);
static PyObject* python_unmap_and_release(PyObject* self, PyObject* args);
// Utility
PyObject* create_tuple_from_c_integers(unsigned long long a,
unsigned long long b, unsigned long long c, unsigned long long d);
Import
#include "cumem_allocator_compat.h"
#include <Python.h>
#include <iostream>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| device | unsigned long long | Yes | CUDA device ordinal to allocate on |
| size | ssize_t | Yes | Allocation size in bytes (must be aligned to granularity) |
| d_mem | CUdeviceptr | Yes | Virtual address to map the allocation to |
| p_memHandle | CUmemGenericAllocationHandle* | Yes | Output handle for the allocated memory |
| g_python_malloc_callback | PyObject* | No | Optional Python callback invoked on allocation |
| g_python_free_callback | PyObject* | No | Optional Python callback invoked on free |
Outputs
| Name | Type | Description |
|---|---|---|
| (return) | PyObject* | Python tuple of (device, size, d_mem, memHandle) for create_and_map |
| error_msg | char[10240] | Human-readable error message buffer on CUDA failure |
| error_code | CUresult | CUDA error code (0 for success) |
Usage Examples
// From Python via the C extension module:
// 1. Initialize the module with optional malloc/free callbacks
// py_init_module(malloc_callback, free_callback)
//
// 2. Create and map GPU memory
// device, size, d_mem, handle = python_create_and_map(device, size, d_mem, handle)
//
// 3. Release GPU memory
// python_unmap_and_release(device, size, d_mem, handle)
// C-level usage:
unsigned long long device = 0;
ssize_t size = 1024 * 1024 * 256; // 256MB
CUdeviceptr d_mem;
cuMemAddressReserve(&d_mem, size, 0, 0, 0);
CUmemGenericAllocationHandle memHandle;
create_and_map(device, size, d_mem, &memHandle);
// ... use memory ...
unmap_and_release(device, size, d_mem, &memHandle);