Implementation:Huggingface Datasets Dataset To Csv
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for exporting a HuggingFace Dataset to CSV file format provided by the HuggingFace Datasets library.
Description
Dataset.to_csv is a method that serializes the dataset's Arrow data to a CSV file. Internally, it delegates to CsvDatasetWriter, which processes the data in configurable batch sizes, converting each batch to a pandas DataFrame and calling DataFrame.to_csv. The method supports multiprocessing for faster writes, remote storage via fsspec storage options, and all pandas to_csv keyword arguments (delimiter, quoting, encoding, etc.). By default, the header is included and the index is omitted.
Usage
Use Dataset.to_csv when you need to export a processed dataset to CSV for consumption by external tools, sharing with collaborators, or archival storage in a universally readable format.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/arrow_dataset.py - Lines: L4995-L5042
Signature
def to_csv(
self,
path_or_buf: Union[PathLike, BinaryIO],
batch_size: Optional[int] = None,
num_proc: Optional[int] = None,
storage_options: Optional[dict] = None,
**to_csv_kwargs,
) -> int:
Import
from datasets import Dataset
# to_csv is a method on Dataset instances
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path_or_buf | Union[PathLike, BinaryIO] |
Yes | Path to a file, a remote URI, or a binary file object where the CSV will be written. |
| batch_size | Optional[int] |
No | Number of rows to load in memory and write at once. Defaults to datasets.config.DEFAULT_MAX_BATCH_SIZE. |
| num_proc | Optional[int] |
No | Number of processes for multiprocessing. Defaults to None (single process). |
| storage_options | Optional[dict] |
No | Key/value pairs for the fsspec file-system backend. |
| **to_csv_kwargs | No | Additional keyword arguments forwarded to pandas.DataFrame.to_csv (e.g., sep, index, header). |
Outputs
| Name | Type | Description |
|---|---|---|
| written | int |
The number of characters or bytes written. |
Usage Examples
Basic Usage
from datasets import load_dataset
ds = load_dataset("csv", data_files="input.csv", split="train")
# Export to CSV
num_bytes = ds.to_csv("output.csv")
# Export with custom separator and multiprocessing
num_bytes = ds.to_csv("output.tsv", sep="\t", num_proc=4)
# Export to a remote location
num_bytes = ds.to_csv(
"hf://datasets/username/my_dataset/data.csv",
storage_options={"token": "hf_..."},
)