Environment:Huggingface Datatrove Slurm Cluster Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, HPC |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
SLURM-based HPC cluster environment for distributed execution of Datatrove pipelines as job arrays.
Description
This environment provides the infrastructure for running Datatrove pipelines on SLURM-managed HPC clusters via the `SlurmPipelineExecutor`. Pipelines are submitted as SLURM job arrays where each array task processes a shard of the data independently. The executor handles job submission, log collection, task completion tracking, and supports features like job requeueing, staggered start delays, and multi-node distributed inference.
Usage
Use this environment when running large-scale data processing that requires distributed execution across many nodes. All six documented workflows (Common Crawl Processing, MinHash Deduplication, FineWeb Dataset Creation, Dataset Tokenization, Synthetic Data Generation, Summary Statistics) use the `SlurmPipelineExecutor` in their production examples.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux | SLURM only runs on Linux |
| Cluster | SLURM job scheduler | Must have `sbatch`, `squeue`, `scontrol`, `srun` commands available |
| Storage | Shared filesystem (NFS, Lustre, GPFS) | All nodes must access the same data and logging directories |
| Network | High-bandwidth interconnect | Required for multi-node inference (Ray-based distributed serving) |
Dependencies
System Requirements
- SLURM workload manager (sbatch, squeue, scontrol, srun)
- Shared filesystem accessible from all compute nodes
- Python environment accessible from all nodes (e.g., via shared filesystem or container)
Python Packages
No additional Python packages beyond the base `datatrove` installation. The `SlurmPipelineExecutor` is part of the core package.
Credentials
The following SLURM environment variables are read (not set) by Datatrove:
- `SLURM_JOB_ID`: Current job ID (used for job requeueing)
- `SLURM_ARRAY_TASK_ID`: Array task index within job
- `RUN_OFFSET`: Optional offset for task array indexing
- `SLURM_NODEID`: Node rank within job allocation
- `SLURM_NODELIST`: Expanded node list for multi-node jobs
Set by Datatrove for pipeline steps:
- `DATATROVE_NODE_RANK`: Node rank (0 = master, -1 = single-node)
- `DATATROVE_EXECUTOR`: Set to "SLURM"
- `DATATROVE_NODE_IPS`: Comma-separated node IPs/hostnames
- `DATATROVE_CPUS_PER_TASK`: CPUs allocated per task
- `DATATROVE_MEM_PER_CPU`: Memory per CPU in GB
- `DATATROVE_GPUS_ON_NODE`: Number of GPUs on node
Quick Install
# No additional installation needed beyond base datatrove
pip install datatrove
# Verify SLURM is available
which sbatch && echo "SLURM available"
Code Evidence
SLURM environment variable reading from `src/datatrove/executor/slurm.py:199-202`:
rank = int(os.environ["SLURM_ARRAY_TASK_ID"])
if "RUN_OFFSET" in os.environ:
rank -= int(os.environ["RUN_OFFSET"])
Distributed environment setup from `src/datatrove/executor/base.py:93-100`:
def _set_distributed_environment(self, node_rank: int):
env_vars = self.get_distributed_env(node_rank)
os.environ["DATATROVE_NODE_RANK"] = str(node_rank)
os.environ["DATATROVE_EXECUTOR"] = env_vars["datatrove_executor"]
os.environ["DATATROVE_NODE_IPS"] = env_vars["datatrove_node_ips"]
os.environ["DATATROVE_CPUS_PER_TASK"] = env_vars["datatrove_cpus_per_task"]
os.environ["DATATROVE_MEM_PER_CPU"] = env_vars["datatrove_mem_per_cpu"]
os.environ["DATATROVE_GPUS_ON_NODE"] = env_vars["datatrove_gpus_on_node"]
Job requeueing from `src/datatrove/executor/slurm.py:44`:
SLURM_JOB_ID = os.environ.get("SLURM_JOB_ID")
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `sbatch: command not found` | SLURM not installed or not in PATH | Ensure SLURM is installed and `sbatch` is accessible |
| `SLURM_ARRAY_TASK_ID not set` | Script run outside SLURM job array | Submit via `SlurmPipelineExecutor.run()` which handles `sbatch` submission |
| Tasks stuck or incomplete | Shared filesystem latency | Increase `randomize_start_duration` to stagger task starts |
Compatibility Notes
- Completion Tracking: Datatrove uses empty marker files in the `completions/` subdirectory of the logging folder. Ensure the shared filesystem supports many small files efficiently.
- Master Node: In multi-node jobs, node_rank == 0 is the master. Only the master node writes completion markers and statistics.
- Job Arrays: Each SLURM array task maps to a Datatrove rank. The number of tasks equals the `world_size` (number of data shards).