Implementation:Huggingface Datatrove InspectData
| Knowledge Sources | |
|---|---|
| Domains | Data Inspection, Quality Assurance |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
InspectData is a command-line tool for interactively browsing, filtering, and labeling data samples from various file formats (JSONL, Parquet, CSV, WARC) using a rich console interface.
Description
The inspect_data module provides an interactive data exploration utility that reads samples from a data folder using Datatrove's reader infrastructure and displays them one at a time in a paginated rich console. It automatically detects the file format by examining file extensions (supporting `.jsonl`, `.json`, `.csv`, `.parquet`, `.warc`, and their compressed variants) and instantiates the appropriate reader class (JsonlReader, CSVReader, ParquetReader, or WarcReader). Users can also explicitly specify the reader type via a command-line flag.
The tool supports random sampling through integration with SamplerFilter, allowing users to inspect a configurable percentage of the total samples. It also supports custom filtering expressions -- users can enter Python expressions that are evaluated against each sample (e.g., `x.metadata['token_count'] > 5000`) to selectively view only matching documents.
A key feature is the labeling mode: when a label output path is specified, users can interactively mark each sample as "good" or "bad" (or skip/exit). At the end of the session, labeled samples are written to `good_samples.jsonl` and `bad_samples.jsonl` files in the specified label folder using JsonlWriter. This enables human-in-the-loop quality assessment workflows. Additional reader-specific parameters can be passed as extra command-line arguments (e.g., `text_key=text`).
Usage
Use this tool to manually inspect data quality, explore dataset contents, or create labeled subsets for evaluation. It is especially useful during data pipeline development to verify that readers, filters, and transformations are producing the expected output.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/tools/inspect_data.py
- Lines: 1-179
Signature
def main():
"""Interactive data inspection with optional filtering and labeling."""
def reader_factory(data_folder: DataFolder, reader_type: str = None, **kwargs):
"""Create appropriate reader based on file type or explicit specification."""
Import
from datatrove.tools.inspect_data import main, reader_factory
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path | str (CLI argument) | No | Path to the data folder to inspect (default: current directory) |
| --reader / -r | str | No | Reader type: "jsonl", "parquet", "csv", or "warc" (default: auto-detected) |
| --sample / -s | float | No | Sampling rate as a fraction of total samples, 1.0 for all (default: 1.0) |
| --label / -l | str | No | Path to save labeled good/bad samples (default: empty, no labeling) |
Outputs
| Name | Type | Description |
|---|---|---|
| Console output | Rich formatted text | Paginated display of each document's ID, metadata, and text content |
| good_samples.jsonl | JSONL file | Documents labeled as "good" (only when labeling is enabled) |
| bad_samples.jsonl | JSONL file | Documents labeled as "bad" (only when labeling is enabled) |
Usage Examples
Basic Usage
# Inspect all samples in a folder
python -m datatrove.tools.inspect_data /path/to/data/
# Inspect 10% of samples with a specific reader
python -m datatrove.tools.inspect_data /path/to/data/ -r parquet -s 0.1
# Inspect with labeling enabled
python -m datatrove.tools.inspect_data /path/to/data/ -l /path/to/labels/