Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Concatenate Datasets

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for combining multiple datasets by stacking rows or columns provided by the HuggingFace Datasets library.

Description

concatenate_datasets is a function that takes a list of Dataset or IterableDataset objects and returns a single concatenated dataset. With axis=0 (the default), it stacks rows vertically, requiring all datasets to have compatible schemas. With axis=1, it joins columns horizontally, requiring all datasets to have the same number of rows. The function handles both map-style and iterable datasets (but not a mix of both). For map-style datasets, it delegates to _concatenate_map_style_datasets; for iterable datasets, it delegates to _concatenate_iterable_datasets. Optional info and split parameters allow overriding the metadata of the resulting dataset.

Usage

Use concatenate_datasets when you need to merge multiple datasets into one. Common use cases include combining training splits from different sources, adding computed feature columns to an existing dataset, and creating multi-domain training corpora.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/combine.py
  • Lines: L168-L232

Signature

def concatenate_datasets(
    dsets: list[DatasetType],
    info: Optional[DatasetInfo] = None,
    split: Optional[NamedSplit] = None,
    axis: int = 0,
) -> DatasetType:

Import

from datasets import concatenate_datasets

I/O Contract

Inputs

Name Type Required Description
dsets list[Dataset] or list[IterableDataset] Yes List of datasets to concatenate. All must be the same type (all Dataset or all IterableDataset).
info Optional[DatasetInfo] No Dataset information (description, citation, etc.) to assign to the result.
split Optional[NamedSplit] No Name of the dataset split to assign to the result.
axis int No Concatenation axis: 0 for vertical (rows), 1 for horizontal (columns). Defaults to 0.

Outputs

Name Type Description
dataset Dataset or IterableDataset The concatenated dataset. Type matches the input datasets.

Usage Examples

Basic Usage

from datasets import Dataset, concatenate_datasets

# Vertical concatenation (stacking rows)
ds1 = Dataset.from_dict({"text": ["hello", "world"], "label": [0, 1]})
ds2 = Dataset.from_dict({"text": ["foo", "bar"], "label": [1, 0]})
combined = concatenate_datasets([ds1, ds2])
print(len(combined))  # 4

# Horizontal concatenation (joining columns)
ds_features = Dataset.from_dict({"input_ids": [[1, 2], [3, 4]]})
ds_labels = Dataset.from_dict({"label": [0, 1]})
combined = concatenate_datasets([ds_features, ds_labels], axis=1)
print(combined.column_names)  # ['input_ids', 'label']

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment