Implementation:Huggingface Datasets Concatenate Datasets
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for combining multiple datasets by stacking rows or columns provided by the HuggingFace Datasets library.
Description
concatenate_datasets is a function that takes a list of Dataset or IterableDataset objects and returns a single concatenated dataset. With axis=0 (the default), it stacks rows vertically, requiring all datasets to have compatible schemas. With axis=1, it joins columns horizontally, requiring all datasets to have the same number of rows. The function handles both map-style and iterable datasets (but not a mix of both). For map-style datasets, it delegates to _concatenate_map_style_datasets; for iterable datasets, it delegates to _concatenate_iterable_datasets. Optional info and split parameters allow overriding the metadata of the resulting dataset.
Usage
Use concatenate_datasets when you need to merge multiple datasets into one. Common use cases include combining training splits from different sources, adding computed feature columns to an existing dataset, and creating multi-domain training corpora.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/combine.py - Lines: L168-L232
Signature
def concatenate_datasets(
dsets: list[DatasetType],
info: Optional[DatasetInfo] = None,
split: Optional[NamedSplit] = None,
axis: int = 0,
) -> DatasetType:
Import
from datasets import concatenate_datasets
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dsets | list[Dataset] or list[IterableDataset] |
Yes | List of datasets to concatenate. All must be the same type (all Dataset or all IterableDataset). |
| info | Optional[DatasetInfo] |
No | Dataset information (description, citation, etc.) to assign to the result. |
| split | Optional[NamedSplit] |
No | Name of the dataset split to assign to the result. |
| axis | int |
No | Concatenation axis: 0 for vertical (rows), 1 for horizontal (columns). Defaults to 0. |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dataset or IterableDataset |
The concatenated dataset. Type matches the input datasets. |
Usage Examples
Basic Usage
from datasets import Dataset, concatenate_datasets
# Vertical concatenation (stacking rows)
ds1 = Dataset.from_dict({"text": ["hello", "world"], "label": [0, 1]})
ds2 = Dataset.from_dict({"text": ["foo", "bar"], "label": [1, 0]})
combined = concatenate_datasets([ds1, ds2])
print(len(combined)) # 4
# Horizontal concatenation (joining columns)
ds_features = Dataset.from_dict({"input_ids": [[1, 2], [3, 4]]})
ds_labels = Dataset.from_dict({"label": [0, 1]})
combined = concatenate_datasets([ds_features, ds_labels], axis=1)
print(combined.column_names) # ['input_ids', 'label']