Principle:Huggingface Datatrove Inference Progress Monitoring
| Knowledge Sources | |
|---|---|
| Domains | Monitoring, Observability |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Tracking and reporting progress of long-running inference jobs with live dataset card updates and ETA calculations.
Description
Progress monitoring provides visibility into long-running synthetic data generation jobs by periodically counting completed documents, calculating progress percentages and time estimates, and updating the HuggingFace dataset card with a visual progress bar. This enables stakeholders to track job status without requiring direct access to compute infrastructure or log files.
The monitoring system addresses several challenges specific to large-scale inference jobs:
- Visibility gap: Inference jobs can run for hours or days. Without progress monitoring, there is no way for team members or collaborators to know the job's status without accessing the Slurm cluster or reading logs.
- ETA estimation: By tracking documents processed over time, the monitor calculates throughput (documents per second) and projects completion time, helping with resource planning and scheduling.
- Failure detection: The monitor can optionally track a Slurm job ID, detecting when the inference job has stopped running without producing a Template:Code completion marker. This prevents the monitor from running indefinitely after a silent failure.
- Live documentation: By embedding the progress bar directly in the dataset card on HuggingFace Hub, the dataset's README serves as both documentation and a live status dashboard.
The progress monitoring step runs as a companion pipeline step alongside the inference step. It operates independently (typically in a separate Slurm job or process) and polls the HuggingFace Hub repository to count uploaded documents rather than relying on inter-process communication.
Usage
Progress monitoring is used:
- As a companion step to inference for long-running generation jobs (hours to days)
- When team visibility into job progress is needed via the HuggingFace Hub dataset page
- When ETA estimation is needed for resource planning
- When monitoring Slurm job health alongside data generation
Theoretical Basis
Document counting approach:
The monitor counts documents by reading Parquet file metadata headers from the HuggingFace Hub repository. This approach:
- Reads only file headers (a few KB per file), not the actual data
- Avoids downloading large Parquet files to local disk
- Uses Template:Code with cache invalidation to get fresh file listings on each poll
- Sums Template:Code from each Parquet file's metadata to get the total document count
Progress calculation:
The total expected documents is determined from the input dataset:
- First attempts to read from the dataset builder's metadata (fastest, no data download)
- Falls back to loading the dataset and counting rows
- Uses Template:Code as a fallback if specified
Progress percentage is calculated as Template:Code.
ETA estimation:
ETA uses a simple linear projection:
- Compute throughput: Template:Code
- Compute remaining: Template:Code
- Project: Template:Code
This linear model works well for inference workloads which typically have stable throughput after warmup.
Progress bar rendering:
The visual progress bar uses a 20-character dot format:
Where filled dots represent completion and empty dots represent remaining work. The bar includes document count, percentage, time remaining, and projected completion datetime.
Completion detection:
The monitor uses two completion signals:
- stats.json detection: The datatrove executor writes a Template:Code file when the job completes successfully. The monitor checks for this file at each polling interval.
- Slurm job status: If an Template:Code is provided, the monitor runs Template:Code to check if the Slurm job is still in the queue. If the job has stopped and Template:Code does not exist, the monitor exits (indicating a failure or cancellation).
Update cadence:
The monitor updates at configurable intervals (default: 3600 seconds / 1 hour). Each update cycle: counts documents -> calculates progress -> renders progress bar -> builds and uploads dataset card.