Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Dataset To Csv

From Leeroopedia
Revision as of 12:59, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Huggingface_Datasets_Dataset_To_Csv.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for exporting a HuggingFace Dataset to CSV file format provided by the HuggingFace Datasets library.

Description

Dataset.to_csv is a method that serializes the dataset's Arrow data to a CSV file. Internally, it delegates to CsvDatasetWriter, which processes the data in configurable batch sizes, converting each batch to a pandas DataFrame and calling DataFrame.to_csv. The method supports multiprocessing for faster writes, remote storage via fsspec storage options, and all pandas to_csv keyword arguments (delimiter, quoting, encoding, etc.). By default, the header is included and the index is omitted.

Usage

Use Dataset.to_csv when you need to export a processed dataset to CSV for consumption by external tools, sharing with collaborators, or archival storage in a universally readable format.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/arrow_dataset.py
  • Lines: L4995-L5042

Signature

def to_csv(
    self,
    path_or_buf: Union[PathLike, BinaryIO],
    batch_size: Optional[int] = None,
    num_proc: Optional[int] = None,
    storage_options: Optional[dict] = None,
    **to_csv_kwargs,
) -> int:

Import

from datasets import Dataset
# to_csv is a method on Dataset instances

I/O Contract

Inputs

Name Type Required Description
path_or_buf Union[PathLike, BinaryIO] Yes Path to a file, a remote URI, or a binary file object where the CSV will be written.
batch_size Optional[int] No Number of rows to load in memory and write at once. Defaults to datasets.config.DEFAULT_MAX_BATCH_SIZE.
num_proc Optional[int] No Number of processes for multiprocessing. Defaults to None (single process).
storage_options Optional[dict] No Key/value pairs for the fsspec file-system backend.
**to_csv_kwargs No Additional keyword arguments forwarded to pandas.DataFrame.to_csv (e.g., sep, index, header).

Outputs

Name Type Description
written int The number of characters or bytes written.

Usage Examples

Basic Usage

from datasets import load_dataset

ds = load_dataset("csv", data_files="input.csv", split="train")

# Export to CSV
num_bytes = ds.to_csv("output.csv")

# Export with custom separator and multiprocessing
num_bytes = ds.to_csv("output.tsv", sep="\t", num_proc=4)

# Export to a remote location
num_bytes = ds.to_csv(
    "hf://datasets/username/my_dataset/data.csv",
    storage_options={"token": "hf_..."},
)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment