Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets IterableDataset Take

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for taking a fixed number of elements from a streaming dataset provided by the HuggingFace Datasets library.

Description

IterableDataset.take wraps the dataset's internal example iterable with a TakeExamplesIterable. This iterable counts elements as they are yielded and stops iteration after n elements have been produced.

Internally, the method:

  1. Creates a TakeExamplesIterable wrapping the existing _ex_iterable with the count n.
  2. Sets split_when_sharding=True if the dataset is not in distributed mode (self._distributed is None), allowing the take count to be split across data-loading workers.
  3. Returns a new IterableDataset with the wrapped iterable, preserving info, split, formatting, and distributed settings.

The take operation fixes the shard order, meaning subsequent shuffle operations will not reorder shards (only buffer shuffling applies).

Usage

Use IterableDataset.take when you need to limit the number of elements consumed from a streaming dataset, whether for quick inspection, creating evaluation subsets, or bounding training iterations.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/iterable_dataset.py
  • Lines: L3171-L3204

Signature

def take(self, n: int) -> "IterableDataset":

Import

from datasets import load_dataset

ds = load_dataset("my_dataset", split="train", streaming=True)
# take is a method on the returned IterableDataset
small_ds = ds.take(100)

I/O Contract

Inputs

Name Type Required Description
n int Yes Number of elements to take from the beginning of the stream.

Outputs

Name Type Description
dataset IterableDataset A new streaming dataset that yields at most n elements.

Usage Examples

Basic Usage

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True)

small_ds = ds.take(2)
list(small_ds)
# [{'label': 1, 'text': 'the rock is destined to be the 21st century\'s new "conan" ...'},
#  {'label': 1, 'text': 'the gorgeously elaborate continuation of "the lord of the rings" ...'}]

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment