Environment:Huggingface Datatrove Inference GPU Environment

Knowledge Sources	Huggingface Datatrove pyproject.toml Inference README
Domains	Infrastructure, Deep_Learning, Inference
Last Updated	2026-02-14 17:00 GMT

Overview

GPU-accelerated inference environment with vLLM, SGLang, CUDA support, and async I/O for running LLM inference pipelines at scale.

Description

This environment provides everything needed to run Datatrove's synthetic data generation and LLM inference pipelines. It includes the vLLM and SGLang inference engines, quantization support (bitsandbytes), async HTTP and database clients (httpx, aiosqlite), HuggingFace Transformers, and CLI tools (typer). It requires at least one CUDA-compatible GPU for local execution. The environment also sets specific environment variables to optimize startup time by disabling TensorFlow imports.

Usage

Use this environment for the Synthetic Data Generation workflow and any pipeline step that involves LLM inference: InferenceRunner, InferenceServer (VLLMServer, SGLangServer), CheckpointManager, InferenceDatasetCardGenerator, and InferenceProgressMonitor. Local execution requires CUDA GPUs; remote execution via endpoint servers does not.

System Requirements

Category	Requirement	Notes
OS	Linux	vLLM and SGLang require Linux
Hardware	NVIDIA GPU with CUDA support	Minimum 1 GPU for local execution; multi-GPU supported via tensor/pipeline parallelism
VRAM	Varies by model	7B models: ~16GB; 70B models: ~80GB+ (multi-GPU); quantized models require less
Disk	10GB+	Model weights downloaded on first use

Dependencies

Python Packages (Inference Group)

`datatrove[io]` — All IO dependencies (transitive)
`aiofiles` — Async file I/O
`httpx` — Async HTTP client for API calls
`aiosqlite` — Async SQLite for request caching
`vllm` — vLLM LLM inference engine
`sglang` — SGLang inference engine
`bitsandbytes` — Quantization support
`numpy` >= 2.0.0, < 2.3 — Upper bound due to numba compatibility
`typer` — CLI framework
`pyyaml` — YAML configuration parsing
`pandas` — Data manipulation
`transformers` >= 4.57 — HuggingFace Transformers (vLLM compatibility constraint)

Credentials

The following environment variables may be needed:

`HF_TOKEN`: HuggingFace API token (for accessing gated models like Llama)
`VLLM_COMPILE_LOCK_DIR`: Directory for vLLM torch.compile cache locks (default: `~/.cache/vllm_compile_locks`). Must be on shared filesystem for multi-job scenarios.

Automatically set by Datatrove (do not set manually):

`USE_TF`: Set to "0" to disable TensorFlow loading
`TRANSFORMERS_NO_TF`: Set to "1" to prevent transformers from importing TensorFlow
`DATATROVE_NODE_RANK`: Node rank in distributed execution
`DATATROVE_EXECUTOR`: Executor type (LOCAL, SLURM, RAY)
`DATATROVE_NODE_IPS`: Comma-separated node IP addresses
`DATATROVE_CPUS_PER_TASK`: CPUs allocated per task
`DATATROVE_MEM_PER_CPU`: Memory per CPU in GB
`DATATROVE_GPUS_ON_NODE`: Number of GPUs on node

Quick Install

# Install datatrove with inference dependencies
pip install "datatrove[inference]"

# Or install packages individually
pip install "datatrove[io]" aiofiles httpx aiosqlite vllm sglang bitsandbytes "numpy>=2.0.0,<2.3" typer pyyaml pandas "transformers>=4.57"

Code Evidence

GPU requirement check from `examples/inference/generate_data.py:169-177`:

if local_execution:
    import torch
    available_gpus = torch.cuda.device_count()
    if available_gpus == 0:
        raise ValueError("Local execution requires at least one CUDA GPU.")

TensorFlow suppression for faster vLLM startup from `src/datatrove/pipeline/inference/servers/vllm_server.py:93-97`:

# transformers pulls in TensorFlow by default, which adds tens of seconds of startup time
# (we measured ~70-80s at tp=2 on H100). These env vars keep it in PyTorch-only mode so
# vLLM initializes much faster without affecting throughput.
env.setdefault("USE_TF", "0")
env.setdefault("TRANSFORMERS_NO_TF", "1")

vLLM dependency check from `src/datatrove/pipeline/inference/servers/vllm_server.py:40`:

check_required_dependencies("VLLM server", ["vllm"])

numpy upper bound constraint from `pyproject.toml:91`:

"numpy>=2.0.0,<2.3", # numba requires numpy<=2.2

Common Errors

Error Message	Cause	Solution
`ValueError: Local execution requires at least one CUDA GPU.`	No CUDA GPU detected	Ensure NVIDIA GPU with CUDA drivers is available, or use endpoint server mode
`ImportError: Please install vllm to use VLLM server`	vLLM not installed	`pip install vllm`
`CUDA out of memory`	Insufficient VRAM for model	Reduce `gpu_memory_utilization`, use quantization (bitsandbytes), or use multi-GPU parallelism
Slow vLLM startup (70-80s)	TensorFlow being loaded by transformers	Ensure `USE_TF=0` and `TRANSFORMERS_NO_TF=1` are set (automatic in Datatrove)

Compatibility Notes

numpy < 2.3: The inference group pins numpy below 2.3 because numba (used by some inference backends) requires numpy <= 2.2.
transformers >= 4.57: Pinned for vLLM compatibility. The comment notes "as long as vllm does not support v5".
lighteval excluded from testing: The decont group with lighteval is excluded from CI because lighteval has restrictive vllm version requirements: `vllm>=0.10.0,<0.10.2`.
Flask pin: Testing uses `flask>=3.1.0` due to incorrect werkzeug dependency resolution in older Flask versions.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment