Principle:Huggingface Datasets Dataset Concatenation

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, NLP
Last Updated	2026-02-14 18:00 GMT

Overview

Dataset Concatenation is the principle of combining multiple datasets into a single dataset by stacking rows vertically (axis=0) or columns horizontally (axis=1).

Description

When building training corpora from multiple sources, it is common to need to merge datasets together. The Dataset Concatenation principle covers vertical concatenation (appending rows from multiple datasets with the same schema) and horizontal concatenation (joining columns from datasets with the same number of rows). The function works with both map-style Dataset objects and streaming IterableDataset objects. For iterable datasets, vertical concatenation sums the shard counts and horizontal concatenation reduces to a single shard to prevent misalignment. Optional info and split parameters allow overriding the metadata of the resulting dataset.

Usage

Use Dataset Concatenation when you need to merge multiple datasets that share the same schema (vertical) or the same number of rows (horizontal). Common use cases include combining training splits from different sources, merging feature columns computed by separate processing steps, and creating multi-task training datasets.

Theoretical Basis

Vertical concatenation (axis=0) creates a virtual table whose rows are the union of all input tables. For map-style datasets this is implemented by concatenating the underlying Arrow tables and merging their index mappings. Horizontal concatenation (axis=1) creates a table whose columns are the union of all input tables, requiring that all datasets have the same row count. Schema compatibility is enforced: all datasets must have matching column types for vertical concatenation or non-overlapping column names for horizontal concatenation.

Related Pages

Implemented By

Implementation:Huggingface_Datasets_Concatenate_Datasets

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment