Implementation:Huggingface Datatrove JobsStatus
| Knowledge Sources | |
|---|---|
| Domains | Pipeline Operations, Monitoring |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
JobsStatus is a command-line tool that scans a directory for pipeline logging folders and reports the completion status of all jobs and their individual tasks.
Description
The jobs_status module provides a monitoring tool that scans a parent directory for subdirectories containing executor.json configuration files, treating each as a separate pipeline job. For each discovered job, it reads the world_size (total number of tasks) from the executor configuration and counts the number of completed tasks by scanning the completions subdirectory for completion marker files.
The tool reports per-job completion status with a visual indicator (a checkmark for complete jobs, a cross for incomplete ones), showing the fraction and percentage of completed tasks for each job. It also provides an aggregate summary showing total jobs completed and total tasks completed across all discovered jobs. The tool supports a log_prefix filter to scan only directories matching a specific naming pattern, and a hide_complete flag to suppress fully completed jobs from the output.
The tool uses the rich library for formatted console output, including status spinners during directory scanning and styled log messages. It gracefully handles missing or invalid logging directories by logging warnings and continuing to the next directory.
Usage
Use this tool to get a quick overview of all pipeline jobs in a directory, especially when running multiple pipeline stages or experiments that each produce their own logging folder. This is useful for monitoring batch runs and identifying which jobs need attention.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/tools/jobs_status.py
- Lines: 1-97
Signature
def main():
"""
Takes a `path` as input, gets all valid job folders and their total
number of tasks from `executor.json` and then gets which ranks are
incomplete by scanning `path/{LOGGING_DIRS}/completions`.
"""
Import
from datatrove.tools.jobs_status import main
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path | str (CLI argument) | No | Path to parent directory containing logging folders (default: current directory) |
| --log_prefix / -p | str | No | Prefix filter for logging directory names (default: empty string, matches all) |
| --hide_complete / -hc | flag | No | When set, hides jobs that are fully complete from the output |
Outputs
| Name | Type | Description |
|---|---|---|
| Console output | Rich formatted text | Per-job completion status with counts and percentages, plus aggregate summary |
Usage Examples
Basic Usage
# Show status of all jobs in current directory
python -m datatrove.tools.jobs_status /path/to/runs/
# Show only incomplete jobs with a specific prefix
python -m datatrove.tools.jobs_status /path/to/runs/ -p "tokenize_" -hc