Environment:Huggingface Datatrove IO Dependencies
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Data_Processing |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
I/O dependency group providing support for reading and writing WARC, Parquet, JSONL (with zstd compression), and HuggingFace datasets.
Description
This environment extends the base Python runtime with packages needed for reading and writing various data formats. It includes WARC archive handling (warcio), columnar data support (pyarrow), HuggingFace datasets integration, character encoding detection (faust-cchardet), MIME type detection (python-magic), fast JSON serialization (orjson), and Zstandard compression support.
Usage
Use this environment when running any pipeline that reads or writes data files. This includes all reader steps (WarcReader, JsonlReader, HuggingFaceDatasetReader, ParquetReader), all writer steps (JsonlWriter, ParquetWriter, HuggingFaceWriter), and any step that handles WARC archives or compressed data.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu recommended) | `python-magic` requires `libmagic` system library |
| System Libraries | `libmagic` | Required by `python-magic` for MIME type detection |
Dependencies
Python Packages
- `faust-cchardet` — Character encoding detection
- `pyarrow` — Apache Arrow / Parquet format support
- `python-magic` — MIME type detection (requires system `libmagic`)
- `warcio` — WARC web archive format reading
- `datasets` >= 3.1.0 — HuggingFace Datasets library
- `orjson` — Fast JSON serialization
- `zstandard` — Zstandard compression/decompression
Credentials
- `HF_TOKEN`: HuggingFace API token (required when accessing gated or private datasets through `HuggingFaceDatasetReader`)
Quick Install
# Install datatrove with IO dependencies
pip install "datatrove[io]"
# Or install packages individually
pip install faust-cchardet pyarrow python-magic warcio "datasets>=3.1.0" orjson zstandard
# System dependency (Ubuntu/Debian)
sudo apt-get install libmagic1
Code Evidence
IO dependency group from `pyproject.toml:39-47`:
io = [
"faust-cchardet",
"pyarrow",
"python-magic",
"warcio",
"datasets>=3.1.0",
"orjson",
"zstandard",
]
Optional availability check from `src/datatrove/utils/_import_utils.py:76-77`:
def is_pyarrow_available():
return _is_package_available("pyarrow")
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ImportError: Please install pyarrow to use ParquetReader` | pyarrow not installed | `pip install "datatrove[io]"` or `pip install pyarrow` |
| `ImportError: Please install warcio to use WarcReader` | warcio not installed | `pip install "datatrove[io]"` or `pip install warcio` |
| `MagicException: could not find any magic files` | libmagic system library missing | `sudo apt-get install libmagic1` on Ubuntu/Debian |
Compatibility Notes
- datasets >= 3.1.0: Required minimum version for compatibility with current HuggingFace ecosystem.
- orjson: Provides faster JSON serialization than stdlib `json`. Falls back to stdlib if not installed.