Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Huggingface Datatrove IO Dependencies

From Leeroopedia
Knowledge Sources
Domains Infrastructure, Data_Processing
Last Updated 2026-02-14 17:00 GMT

Overview

I/O dependency group providing support for reading and writing WARC, Parquet, JSONL (with zstd compression), and HuggingFace datasets.

Description

This environment extends the base Python runtime with packages needed for reading and writing various data formats. It includes WARC archive handling (warcio), columnar data support (pyarrow), HuggingFace datasets integration, character encoding detection (faust-cchardet), MIME type detection (python-magic), fast JSON serialization (orjson), and Zstandard compression support.

Usage

Use this environment when running any pipeline that reads or writes data files. This includes all reader steps (WarcReader, JsonlReader, HuggingFaceDatasetReader, ParquetReader), all writer steps (JsonlWriter, ParquetWriter, HuggingFaceWriter), and any step that handles WARC archives or compressed data.

System Requirements

Category Requirement Notes
OS Linux (Ubuntu recommended) `python-magic` requires `libmagic` system library
System Libraries `libmagic` Required by `python-magic` for MIME type detection

Dependencies

Python Packages

  • `faust-cchardet` — Character encoding detection
  • `pyarrow` — Apache Arrow / Parquet format support
  • `python-magic` — MIME type detection (requires system `libmagic`)
  • `warcio` — WARC web archive format reading
  • `datasets` >= 3.1.0 — HuggingFace Datasets library
  • `orjson` — Fast JSON serialization
  • `zstandard` — Zstandard compression/decompression

Credentials

  • `HF_TOKEN`: HuggingFace API token (required when accessing gated or private datasets through `HuggingFaceDatasetReader`)

Quick Install

# Install datatrove with IO dependencies
pip install "datatrove[io]"

# Or install packages individually
pip install faust-cchardet pyarrow python-magic warcio "datasets>=3.1.0" orjson zstandard

# System dependency (Ubuntu/Debian)
sudo apt-get install libmagic1

Code Evidence

IO dependency group from `pyproject.toml:39-47`:

io = [
  "faust-cchardet",
  "pyarrow",
  "python-magic",
  "warcio",
  "datasets>=3.1.0",
  "orjson",
  "zstandard",
]

Optional availability check from `src/datatrove/utils/_import_utils.py:76-77`:

def is_pyarrow_available():
    return _is_package_available("pyarrow")

Common Errors

Error Message Cause Solution
`ImportError: Please install pyarrow to use ParquetReader` pyarrow not installed `pip install "datatrove[io]"` or `pip install pyarrow`
`ImportError: Please install warcio to use WarcReader` warcio not installed `pip install "datatrove[io]"` or `pip install warcio`
`MagicException: could not find any magic files` libmagic system library missing `sudo apt-get install libmagic1` on Ubuntu/Debian

Compatibility Notes

  • datasets >= 3.1.0: Required minimum version for compatibility with current HuggingFace ecosystem.
  • orjson: Provides faster JSON serialization than stdlib `json`. Falls back to stdlib if not installed.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment