Implementation:Huggingface Datatrove InspectData

Knowledge Sources	Huggingface_Datatrove
Domains	Data Inspection, Quality Assurance
Last Updated	2026-02-14 17:00 GMT

Overview

InspectData is a command-line tool for interactively browsing, filtering, and labeling data samples from various file formats (JSONL, Parquet, CSV, WARC) using a rich console interface.

Description

The inspect_data module provides an interactive data exploration utility that reads samples from a data folder using Datatrove's reader infrastructure and displays them one at a time in a paginated rich console. It automatically detects the file format by examining file extensions (supporting `.jsonl`, `.json`, `.csv`, `.parquet`, `.warc`, and their compressed variants) and instantiates the appropriate reader class (JsonlReader, CSVReader, ParquetReader, or WarcReader). Users can also explicitly specify the reader type via a command-line flag.

The tool supports random sampling through integration with SamplerFilter, allowing users to inspect a configurable percentage of the total samples. It also supports custom filtering expressions -- users can enter Python expressions that are evaluated against each sample (e.g., `x.metadata['token_count'] > 5000`) to selectively view only matching documents.

A key feature is the labeling mode: when a label output path is specified, users can interactively mark each sample as "good" or "bad" (or skip/exit). At the end of the session, labeled samples are written to `good_samples.jsonl` and `bad_samples.jsonl` files in the specified label folder using JsonlWriter. This enables human-in-the-loop quality assessment workflows. Additional reader-specific parameters can be passed as extra command-line arguments (e.g., `text_key=text`).

Usage

Use this tool to manually inspect data quality, explore dataset contents, or create labeled subsets for evaluation. It is especially useful during data pipeline development to verify that readers, filters, and transformations are producing the expected output.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/tools/inspect_data.py
Lines: 1-179

Signature

def main():
    """Interactive data inspection with optional filtering and labeling."""

def reader_factory(data_folder: DataFolder, reader_type: str = None, **kwargs):
    """Create appropriate reader based on file type or explicit specification."""

Import

from datatrove.tools.inspect_data import main, reader_factory

I/O Contract

Inputs

Name	Type	Required	Description
path	str (CLI argument)	No	Path to the data folder to inspect (default: current directory)
--reader / -r	str	No	Reader type: "jsonl", "parquet", "csv", or "warc" (default: auto-detected)
--sample / -s	float	No	Sampling rate as a fraction of total samples, 1.0 for all (default: 1.0)
--label / -l	str	No	Path to save labeled good/bad samples (default: empty, no labeling)

Outputs

Name	Type	Description
Console output	Rich formatted text	Paginated display of each document's ID, metadata, and text content
good_samples.jsonl	JSONL file	Documents labeled as "good" (only when labeling is enabled)
bad_samples.jsonl	JSONL file	Documents labeled as "bad" (only when labeling is enabled)

Usage Examples

Basic Usage

# Inspect all samples in a folder
python -m datatrove.tools.inspect_data /path/to/data/

# Inspect 10% of samples with a specific reader
python -m datatrove.tools.inspect_data /path/to/data/ -r parquet -s 0.1

# Inspect with labeling enabled
python -m datatrove.tools.inspect_data /path/to/data/ -l /path/to/labels/

Related Pages

Principle:Huggingface_Datatrove_Data_Inspection

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment