Implementation:Huggingface Datatrove CsvReader

Knowledge Sources	Huggingface_Datatrove
Domains	Data Processing, ETL
Last Updated	2026-02-14 17:00 GMT

Overview

CsvReader is a pipeline reader that reads data from CSV files, converting each row into a separate Document object for downstream processing.

Description

CsvReader extends BaseDiskReader to provide CSV-specific data ingestion capabilities within the Datatrove pipeline framework. It uses Python's built-in csv.DictReader to parse CSV files, treating each row as a dictionary that is then converted into a Document via the inherited get_document_from_dict method.

The reader supports compression options including gzip and zstd (with an "infer" default that automatically detects compression from the file extension). It inherits all standard reader features such as file sharding, progress tracking, document limiting and skipping, and custom adapter functions for transforming raw CSV rows into the expected Document format.

A convenience alias CSVReader is also provided for backward compatibility or stylistic preference. The class integrates seamlessly with Datatrove's DataFolder abstraction, enabling reading from local filesystems, S3, or other supported storage backends.

Usage

Use CsvReader when your input data is stored in CSV format and you need to ingest it into a Datatrove processing pipeline. It is suitable for structured tabular data where each row represents a document, with configurable text and ID column mappings.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/pipeline/readers/csv.py
Lines: 1-80

Signature

class CsvReader(BaseDiskReader):
    name = "🔢 Csv"

    def __init__(
        self,
        data_folder: DataFolderLike,
        paths_file: DataFileLike | None = None,
        compression: Literal["infer", "gzip", "zstd"] | None = "infer",
        limit: int = -1,
        skip: int = 0,
        file_progress: bool = False,
        doc_progress: bool = False,
        adapter: Callable = None,
        text_key: str = "text",
        id_key: str = "id",
        default_metadata: dict = None,
        recursive: bool = True,
        glob_pattern: str | None = None,
        shuffle_files: bool = False,
    )

Import

from datatrove.pipeline.readers.csv import CsvReader

I/O Contract

Inputs

Name	Type	Required	Description
data_folder	DataFolderLike	Yes	Path or filesystem object pointing to the folder containing CSV files
paths_file	DataFileLike or None	No	Optional file listing specific paths to read (one per line)
compression	Literal["infer", "gzip", "zstd"] or None	No	Compression format; defaults to "infer" which auto-detects from file extension
limit	int	No	Maximum number of documents to read; -1 means no limit
skip	int	No	Number of initial rows to skip
file_progress	bool	No	Whether to show a progress bar for files
doc_progress	bool	No	Whether to show a progress bar for documents
adapter	Callable	No	Custom function to transform raw data dicts into Document-compatible dicts
text_key	str	No	Column name containing document text (default: "text")
id_key	str	No	Column name containing document ID (default: "id")
default_metadata	dict	No	Default metadata added to all documents
recursive	bool	No	Whether to search for files recursively (default: True)
glob_pattern	str or None	No	Glob pattern to filter which files are included
shuffle_files	bool	No	Whether to shuffle files within the returned shard

Outputs

Name	Type	Description
documents	Generator[Document]	Yields Document objects, one per CSV row, with text and metadata extracted from columns

Usage Examples

Basic Usage

from datatrove.pipeline.readers.csv import CsvReader

# Read all CSV files from a local directory
reader = CsvReader(
    data_folder="path/to/csv/files",
    text_key="content",
    id_key="doc_id",
)

With Compression

from datatrove.pipeline.readers.csv import CsvReader

# Read gzip-compressed CSV files
reader = CsvReader(
    data_folder="s3://my-bucket/csv-data/",
    compression="gzip",
    glob_pattern="*.csv.gz",
)

Related Pages

Principle:Huggingface_Datatrove_CSV_Data_Reading

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment