Environment:Huggingface Datatrove Inference GPU Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Deep_Learning, Inference |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
GPU-accelerated inference environment with vLLM, SGLang, CUDA support, and async I/O for running LLM inference pipelines at scale.
Description
This environment provides everything needed to run Datatrove's synthetic data generation and LLM inference pipelines. It includes the vLLM and SGLang inference engines, quantization support (bitsandbytes), async HTTP and database clients (httpx, aiosqlite), HuggingFace Transformers, and CLI tools (typer). It requires at least one CUDA-compatible GPU for local execution. The environment also sets specific environment variables to optimize startup time by disabling TensorFlow imports.
Usage
Use this environment for the Synthetic Data Generation workflow and any pipeline step that involves LLM inference: InferenceRunner, InferenceServer (VLLMServer, SGLangServer), CheckpointManager, InferenceDatasetCardGenerator, and InferenceProgressMonitor. Local execution requires CUDA GPUs; remote execution via endpoint servers does not.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux | vLLM and SGLang require Linux |
| Hardware | NVIDIA GPU with CUDA support | Minimum 1 GPU for local execution; multi-GPU supported via tensor/pipeline parallelism |
| VRAM | Varies by model | 7B models: ~16GB; 70B models: ~80GB+ (multi-GPU); quantized models require less |
| Disk | 10GB+ | Model weights downloaded on first use |
Dependencies
Python Packages (Inference Group)
- `datatrove[io]` — All IO dependencies (transitive)
- `aiofiles` — Async file I/O
- `httpx` — Async HTTP client for API calls
- `aiosqlite` — Async SQLite for request caching
- `vllm` — vLLM LLM inference engine
- `sglang` — SGLang inference engine
- `bitsandbytes` — Quantization support
- `numpy` >= 2.0.0, < 2.3 — Upper bound due to numba compatibility
- `typer` — CLI framework
- `pyyaml` — YAML configuration parsing
- `pandas` — Data manipulation
- `transformers` >= 4.57 — HuggingFace Transformers (vLLM compatibility constraint)
Credentials
The following environment variables may be needed:
- `HF_TOKEN`: HuggingFace API token (for accessing gated models like Llama)
- `VLLM_COMPILE_LOCK_DIR`: Directory for vLLM torch.compile cache locks (default: `~/.cache/vllm_compile_locks`). Must be on shared filesystem for multi-job scenarios.
Automatically set by Datatrove (do not set manually):
- `USE_TF`: Set to "0" to disable TensorFlow loading
- `TRANSFORMERS_NO_TF`: Set to "1" to prevent transformers from importing TensorFlow
- `DATATROVE_NODE_RANK`: Node rank in distributed execution
- `DATATROVE_EXECUTOR`: Executor type (LOCAL, SLURM, RAY)
- `DATATROVE_NODE_IPS`: Comma-separated node IP addresses
- `DATATROVE_CPUS_PER_TASK`: CPUs allocated per task
- `DATATROVE_MEM_PER_CPU`: Memory per CPU in GB
- `DATATROVE_GPUS_ON_NODE`: Number of GPUs on node
Quick Install
# Install datatrove with inference dependencies
pip install "datatrove[inference]"
# Or install packages individually
pip install "datatrove[io]" aiofiles httpx aiosqlite vllm sglang bitsandbytes "numpy>=2.0.0,<2.3" typer pyyaml pandas "transformers>=4.57"
Code Evidence
GPU requirement check from `examples/inference/generate_data.py:169-177`:
if local_execution:
import torch
available_gpus = torch.cuda.device_count()
if available_gpus == 0:
raise ValueError("Local execution requires at least one CUDA GPU.")
TensorFlow suppression for faster vLLM startup from `src/datatrove/pipeline/inference/servers/vllm_server.py:93-97`:
# transformers pulls in TensorFlow by default, which adds tens of seconds of startup time
# (we measured ~70-80s at tp=2 on H100). These env vars keep it in PyTorch-only mode so
# vLLM initializes much faster without affecting throughput.
env.setdefault("USE_TF", "0")
env.setdefault("TRANSFORMERS_NO_TF", "1")
vLLM dependency check from `src/datatrove/pipeline/inference/servers/vllm_server.py:40`:
check_required_dependencies("VLLM server", ["vllm"])
numpy upper bound constraint from `pyproject.toml:91`:
"numpy>=2.0.0,<2.3", # numba requires numpy<=2.2
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ValueError: Local execution requires at least one CUDA GPU.` | No CUDA GPU detected | Ensure NVIDIA GPU with CUDA drivers is available, or use endpoint server mode |
| `ImportError: Please install vllm to use VLLM server` | vLLM not installed | `pip install vllm` |
| `CUDA out of memory` | Insufficient VRAM for model | Reduce `gpu_memory_utilization`, use quantization (bitsandbytes), or use multi-GPU parallelism |
| Slow vLLM startup (70-80s) | TensorFlow being loaded by transformers | Ensure `USE_TF=0` and `TRANSFORMERS_NO_TF=1` are set (automatic in Datatrove) |
Compatibility Notes
- numpy < 2.3: The inference group pins numpy below 2.3 because numba (used by some inference backends) requires numpy <= 2.2.
- transformers >= 4.57: Pinned for vLLM compatibility. The comment notes "as long as vllm does not support v5".
- lighteval excluded from testing: The decont group with lighteval is excluded from CI because lighteval has restrictive vllm version requirements: `vllm>=0.10.0,<0.10.2`.
- Flask pin: Testing uses `flask>=3.1.0` due to incorrect werkzeug dependency resolution in older Flask versions.