Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datatrove InspectData

From Leeroopedia
Knowledge Sources
Domains Data Inspection, Quality Assurance
Last Updated 2026-02-14 17:00 GMT

Overview

InspectData is a command-line tool for interactively browsing, filtering, and labeling data samples from various file formats (JSONL, Parquet, CSV, WARC) using a rich console interface.

Description

The inspect_data module provides an interactive data exploration utility that reads samples from a data folder using Datatrove's reader infrastructure and displays them one at a time in a paginated rich console. It automatically detects the file format by examining file extensions (supporting `.jsonl`, `.json`, `.csv`, `.parquet`, `.warc`, and their compressed variants) and instantiates the appropriate reader class (JsonlReader, CSVReader, ParquetReader, or WarcReader). Users can also explicitly specify the reader type via a command-line flag.

The tool supports random sampling through integration with SamplerFilter, allowing users to inspect a configurable percentage of the total samples. It also supports custom filtering expressions -- users can enter Python expressions that are evaluated against each sample (e.g., `x.metadata['token_count'] > 5000`) to selectively view only matching documents.

A key feature is the labeling mode: when a label output path is specified, users can interactively mark each sample as "good" or "bad" (or skip/exit). At the end of the session, labeled samples are written to `good_samples.jsonl` and `bad_samples.jsonl` files in the specified label folder using JsonlWriter. This enables human-in-the-loop quality assessment workflows. Additional reader-specific parameters can be passed as extra command-line arguments (e.g., `text_key=text`).

Usage

Use this tool to manually inspect data quality, explore dataset contents, or create labeled subsets for evaluation. It is especially useful during data pipeline development to verify that readers, filters, and transformations are producing the expected output.

Code Reference

Source Location

Signature

def main():
    """Interactive data inspection with optional filtering and labeling."""

def reader_factory(data_folder: DataFolder, reader_type: str = None, **kwargs):
    """Create appropriate reader based on file type or explicit specification."""

Import

from datatrove.tools.inspect_data import main, reader_factory

I/O Contract

Inputs

Name Type Required Description
path str (CLI argument) No Path to the data folder to inspect (default: current directory)
--reader / -r str No Reader type: "jsonl", "parquet", "csv", or "warc" (default: auto-detected)
--sample / -s float No Sampling rate as a fraction of total samples, 1.0 for all (default: 1.0)
--label / -l str No Path to save labeled good/bad samples (default: empty, no labeling)

Outputs

Name Type Description
Console output Rich formatted text Paginated display of each document's ID, metadata, and text content
good_samples.jsonl JSONL file Documents labeled as "good" (only when labeling is enabled)
bad_samples.jsonl JSONL file Documents labeled as "bad" (only when labeling is enabled)

Usage Examples

Basic Usage

# Inspect all samples in a folder
python -m datatrove.tools.inspect_data /path/to/data/

# Inspect 10% of samples with a specific reader
python -m datatrove.tools.inspect_data /path/to/data/ -r parquet -s 0.1

# Inspect with labeling enabled
python -m datatrove.tools.inspect_data /path/to/data/ -l /path/to/labels/

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment