Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datatrove JobsStatus

From Leeroopedia
Knowledge Sources
Domains Pipeline Operations, Monitoring
Last Updated 2026-02-14 17:00 GMT

Overview

JobsStatus is a command-line tool that scans a directory for pipeline logging folders and reports the completion status of all jobs and their individual tasks.

Description

The jobs_status module provides a monitoring tool that scans a parent directory for subdirectories containing executor.json configuration files, treating each as a separate pipeline job. For each discovered job, it reads the world_size (total number of tasks) from the executor configuration and counts the number of completed tasks by scanning the completions subdirectory for completion marker files.

The tool reports per-job completion status with a visual indicator (a checkmark for complete jobs, a cross for incomplete ones), showing the fraction and percentage of completed tasks for each job. It also provides an aggregate summary showing total jobs completed and total tasks completed across all discovered jobs. The tool supports a log_prefix filter to scan only directories matching a specific naming pattern, and a hide_complete flag to suppress fully completed jobs from the output.

The tool uses the rich library for formatted console output, including status spinners during directory scanning and styled log messages. It gracefully handles missing or invalid logging directories by logging warnings and continuing to the next directory.

Usage

Use this tool to get a quick overview of all pipeline jobs in a directory, especially when running multiple pipeline stages or experiments that each produce their own logging folder. This is useful for monitoring batch runs and identifying which jobs need attention.

Code Reference

Source Location

Signature

def main():
    """
    Takes a `path` as input, gets all valid job folders and their total
    number of tasks from `executor.json` and then gets which ranks are
    incomplete by scanning `path/{LOGGING_DIRS}/completions`.
    """

Import

from datatrove.tools.jobs_status import main

I/O Contract

Inputs

Name Type Required Description
path str (CLI argument) No Path to parent directory containing logging folders (default: current directory)
--log_prefix / -p str No Prefix filter for logging directory names (default: empty string, matches all)
--hide_complete / -hc flag No When set, hides jobs that are fully complete from the output

Outputs

Name Type Description
Console output Rich formatted text Per-job completion status with counts and percentages, plus aggregate summary

Usage Examples

Basic Usage

# Show status of all jobs in current directory
python -m datatrove.tools.jobs_status /path/to/runs/

# Show only incomplete jobs with a specific prefix
python -m datatrove.tools.jobs_status /path/to/runs/ -p "tokenize_" -hc

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment