Implementation:Huggingface Datatrove CsvReader
| Knowledge Sources | |
|---|---|
| Domains | Data Processing, ETL |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
CsvReader is a pipeline reader that reads data from CSV files, converting each row into a separate Document object for downstream processing.
Description
CsvReader extends BaseDiskReader to provide CSV-specific data ingestion capabilities within the Datatrove pipeline framework. It uses Python's built-in csv.DictReader to parse CSV files, treating each row as a dictionary that is then converted into a Document via the inherited get_document_from_dict method.
The reader supports compression options including gzip and zstd (with an "infer" default that automatically detects compression from the file extension). It inherits all standard reader features such as file sharding, progress tracking, document limiting and skipping, and custom adapter functions for transforming raw CSV rows into the expected Document format.
A convenience alias CSVReader is also provided for backward compatibility or stylistic preference. The class integrates seamlessly with Datatrove's DataFolder abstraction, enabling reading from local filesystems, S3, or other supported storage backends.
Usage
Use CsvReader when your input data is stored in CSV format and you need to ingest it into a Datatrove processing pipeline. It is suitable for structured tabular data where each row represents a document, with configurable text and ID column mappings.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/pipeline/readers/csv.py
- Lines: 1-80
Signature
class CsvReader(BaseDiskReader):
name = "🔢 Csv"
def __init__(
self,
data_folder: DataFolderLike,
paths_file: DataFileLike | None = None,
compression: Literal["infer", "gzip", "zstd"] | None = "infer",
limit: int = -1,
skip: int = 0,
file_progress: bool = False,
doc_progress: bool = False,
adapter: Callable = None,
text_key: str = "text",
id_key: str = "id",
default_metadata: dict = None,
recursive: bool = True,
glob_pattern: str | None = None,
shuffle_files: bool = False,
)
Import
from datatrove.pipeline.readers.csv import CsvReader
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_folder | DataFolderLike | Yes | Path or filesystem object pointing to the folder containing CSV files |
| paths_file | DataFileLike or None | No | Optional file listing specific paths to read (one per line) |
| compression | Literal["infer", "gzip", "zstd"] or None | No | Compression format; defaults to "infer" which auto-detects from file extension |
| limit | int | No | Maximum number of documents to read; -1 means no limit |
| skip | int | No | Number of initial rows to skip |
| file_progress | bool | No | Whether to show a progress bar for files |
| doc_progress | bool | No | Whether to show a progress bar for documents |
| adapter | Callable | No | Custom function to transform raw data dicts into Document-compatible dicts |
| text_key | str | No | Column name containing document text (default: "text") |
| id_key | str | No | Column name containing document ID (default: "id") |
| default_metadata | dict | No | Default metadata added to all documents |
| recursive | bool | No | Whether to search for files recursively (default: True) |
| glob_pattern | str or None | No | Glob pattern to filter which files are included |
| shuffle_files | bool | No | Whether to shuffle files within the returned shard |
Outputs
| Name | Type | Description |
|---|---|---|
| documents | Generator[Document] | Yields Document objects, one per CSV row, with text and metadata extracted from columns |
Usage Examples
Basic Usage
from datatrove.pipeline.readers.csv import CsvReader
# Read all CSV files from a local directory
reader = CsvReader(
data_folder="path/to/csv/files",
text_key="content",
id_key="doc_id",
)
With Compression
from datatrove.pipeline.readers.csv import CsvReader
# Read gzip-compressed CSV files
reader = CsvReader(
data_folder="s3://my-bucket/csv-data/",
compression="gzip",
glob_pattern="*.csv.gz",
)