Principle:Huggingface Datasets Dataset Concatenation
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Dataset Concatenation is the principle of combining multiple datasets into a single dataset by stacking rows vertically (axis=0) or columns horizontally (axis=1).
Description
When building training corpora from multiple sources, it is common to need to merge datasets together. The Dataset Concatenation principle covers vertical concatenation (appending rows from multiple datasets with the same schema) and horizontal concatenation (joining columns from datasets with the same number of rows). The function works with both map-style Dataset objects and streaming IterableDataset objects. For iterable datasets, vertical concatenation sums the shard counts and horizontal concatenation reduces to a single shard to prevent misalignment. Optional info and split parameters allow overriding the metadata of the resulting dataset.
Usage
Use Dataset Concatenation when you need to merge multiple datasets that share the same schema (vertical) or the same number of rows (horizontal). Common use cases include combining training splits from different sources, merging feature columns computed by separate processing steps, and creating multi-task training datasets.
Theoretical Basis
Vertical concatenation (axis=0) creates a virtual table whose rows are the union of all input tables. For map-style datasets this is implemented by concatenating the underlying Arrow tables and merging their index mappings. Horizontal concatenation (axis=1) creates a table whose columns are the union of all input tables, requiring that all datasets have the same row count. Schema compatibility is enforced: all datasets must have matching column types for vertical concatenation or non-overlapping column names for horizontal concatenation.