Implementation:Huggingface Datasets IterableDataset Take
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for taking a fixed number of elements from a streaming dataset provided by the HuggingFace Datasets library.
Description
IterableDataset.take wraps the dataset's internal example iterable with a TakeExamplesIterable. This iterable counts elements as they are yielded and stops iteration after n elements have been produced.
Internally, the method:
- Creates a
TakeExamplesIterablewrapping the existing_ex_iterablewith the count n. - Sets
split_when_sharding=Trueif the dataset is not in distributed mode (self._distributed is None), allowing the take count to be split across data-loading workers. - Returns a new
IterableDatasetwith the wrapped iterable, preserving info, split, formatting, and distributed settings.
The take operation fixes the shard order, meaning subsequent shuffle operations will not reorder shards (only buffer shuffling applies).
Usage
Use IterableDataset.take when you need to limit the number of elements consumed from a streaming dataset, whether for quick inspection, creating evaluation subsets, or bounding training iterations.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/iterable_dataset.py - Lines: L3171-L3204
Signature
def take(self, n: int) -> "IterableDataset":
Import
from datasets import load_dataset
ds = load_dataset("my_dataset", split="train", streaming=True)
# take is a method on the returned IterableDataset
small_ds = ds.take(100)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| n | int |
Yes | Number of elements to take from the beginning of the stream. |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | IterableDataset |
A new streaming dataset that yields at most n elements. |
Usage Examples
Basic Usage
from datasets import load_dataset
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True)
small_ds = ds.take(2)
list(small_ds)
# [{'label': 1, 'text': 'the rock is destined to be the 21st century\'s new "conan" ...'},
# {'label': 1, 'text': 'the gorgeously elaborate continuation of "the lord of the rings" ...'}]