Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Huggingface Datatrove Slurm Cluster Environment

From Leeroopedia
Knowledge Sources
Domains Infrastructure, HPC
Last Updated 2026-02-14 17:00 GMT

Overview

SLURM-based HPC cluster environment for distributed execution of Datatrove pipelines as job arrays.

Description

This environment provides the infrastructure for running Datatrove pipelines on SLURM-managed HPC clusters via the `SlurmPipelineExecutor`. Pipelines are submitted as SLURM job arrays where each array task processes a shard of the data independently. The executor handles job submission, log collection, task completion tracking, and supports features like job requeueing, staggered start delays, and multi-node distributed inference.

Usage

Use this environment when running large-scale data processing that requires distributed execution across many nodes. All six documented workflows (Common Crawl Processing, MinHash Deduplication, FineWeb Dataset Creation, Dataset Tokenization, Synthetic Data Generation, Summary Statistics) use the `SlurmPipelineExecutor` in their production examples.

System Requirements

Category Requirement Notes
OS Linux SLURM only runs on Linux
Cluster SLURM job scheduler Must have `sbatch`, `squeue`, `scontrol`, `srun` commands available
Storage Shared filesystem (NFS, Lustre, GPFS) All nodes must access the same data and logging directories
Network High-bandwidth interconnect Required for multi-node inference (Ray-based distributed serving)

Dependencies

System Requirements

  • SLURM workload manager (sbatch, squeue, scontrol, srun)
  • Shared filesystem accessible from all compute nodes
  • Python environment accessible from all nodes (e.g., via shared filesystem or container)

Python Packages

No additional Python packages beyond the base `datatrove` installation. The `SlurmPipelineExecutor` is part of the core package.

Credentials

The following SLURM environment variables are read (not set) by Datatrove:

  • `SLURM_JOB_ID`: Current job ID (used for job requeueing)
  • `SLURM_ARRAY_TASK_ID`: Array task index within job
  • `RUN_OFFSET`: Optional offset for task array indexing
  • `SLURM_NODEID`: Node rank within job allocation
  • `SLURM_NODELIST`: Expanded node list for multi-node jobs

Set by Datatrove for pipeline steps:

  • `DATATROVE_NODE_RANK`: Node rank (0 = master, -1 = single-node)
  • `DATATROVE_EXECUTOR`: Set to "SLURM"
  • `DATATROVE_NODE_IPS`: Comma-separated node IPs/hostnames
  • `DATATROVE_CPUS_PER_TASK`: CPUs allocated per task
  • `DATATROVE_MEM_PER_CPU`: Memory per CPU in GB
  • `DATATROVE_GPUS_ON_NODE`: Number of GPUs on node

Quick Install

# No additional installation needed beyond base datatrove
pip install datatrove

# Verify SLURM is available
which sbatch && echo "SLURM available"

Code Evidence

SLURM environment variable reading from `src/datatrove/executor/slurm.py:199-202`:

rank = int(os.environ["SLURM_ARRAY_TASK_ID"])
if "RUN_OFFSET" in os.environ:
    rank -= int(os.environ["RUN_OFFSET"])

Distributed environment setup from `src/datatrove/executor/base.py:93-100`:

def _set_distributed_environment(self, node_rank: int):
    env_vars = self.get_distributed_env(node_rank)
    os.environ["DATATROVE_NODE_RANK"] = str(node_rank)
    os.environ["DATATROVE_EXECUTOR"] = env_vars["datatrove_executor"]
    os.environ["DATATROVE_NODE_IPS"] = env_vars["datatrove_node_ips"]
    os.environ["DATATROVE_CPUS_PER_TASK"] = env_vars["datatrove_cpus_per_task"]
    os.environ["DATATROVE_MEM_PER_CPU"] = env_vars["datatrove_mem_per_cpu"]
    os.environ["DATATROVE_GPUS_ON_NODE"] = env_vars["datatrove_gpus_on_node"]

Job requeueing from `src/datatrove/executor/slurm.py:44`:

SLURM_JOB_ID = os.environ.get("SLURM_JOB_ID")

Common Errors

Error Message Cause Solution
`sbatch: command not found` SLURM not installed or not in PATH Ensure SLURM is installed and `sbatch` is accessible
`SLURM_ARRAY_TASK_ID not set` Script run outside SLURM job array Submit via `SlurmPipelineExecutor.run()` which handles `sbatch` submission
Tasks stuck or incomplete Shared filesystem latency Increase `randomize_start_duration` to stagger task starts

Compatibility Notes

  • Completion Tracking: Datatrove uses empty marker files in the `completions/` subdirectory of the logging folder. Ensure the shared filesystem supports many small files efficiently.
  • Master Node: In multi-node jobs, node_rank == 0 is the master. Only the master node writes completion markers and statistics.
  • Job Arrays: Each SLURM array task maps to a Datatrove rank. The number of tasks equals the `world_size` (number of data shards).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment