Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Huggingface Datasets Dataset Concatenation

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Dataset Concatenation is the principle of combining multiple datasets into a single dataset by stacking rows vertically (axis=0) or columns horizontally (axis=1).

Description

When building training corpora from multiple sources, it is common to need to merge datasets together. The Dataset Concatenation principle covers vertical concatenation (appending rows from multiple datasets with the same schema) and horizontal concatenation (joining columns from datasets with the same number of rows). The function works with both map-style Dataset objects and streaming IterableDataset objects. For iterable datasets, vertical concatenation sums the shard counts and horizontal concatenation reduces to a single shard to prevent misalignment. Optional info and split parameters allow overriding the metadata of the resulting dataset.

Usage

Use Dataset Concatenation when you need to merge multiple datasets that share the same schema (vertical) or the same number of rows (horizontal). Common use cases include combining training splits from different sources, merging feature columns computed by separate processing steps, and creating multi-task training datasets.

Theoretical Basis

Vertical concatenation (axis=0) creates a virtual table whose rows are the union of all input tables. For map-style datasets this is implemented by concatenating the underlying Arrow tables and merging their index mappings. Horizontal concatenation (axis=1) creates a table whose columns are the union of all input tables, requiring that all datasets have the same row count. Schema compatibility is enforced: all datasets must have matching column types for vertical concatenation or non-overlapping column names for horizontal concatenation.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment