Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets IterableDataset Skip

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for skipping a fixed number of elements from a streaming dataset provided by the HuggingFace Datasets library.

Description

IterableDataset.skip wraps the dataset's internal example iterable with a SkipExamplesIterable. This iterable consumes and discards the first n elements from the source during iteration, then yields all subsequent elements normally.

Internally, the method:

  1. Creates a SkipExamplesIterable wrapping the existing _ex_iterable with the count n.
  2. Sets split_when_sharding=True if the dataset is not in distributed mode (self._distributed is None), allowing the skip count to be divided across data-loading workers.
  3. Returns a new IterableDataset with the wrapped iterable, preserving info, split, formatting, and distributed settings.

Like take, the skip operation fixes the shard order to maintain deterministic behavior.

Usage

Use IterableDataset.skip when you need to bypass the first n elements of a streaming dataset, whether for resuming from a checkpoint, creating train/validation splits, or implementing offset-based access patterns.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/iterable_dataset.py
  • Lines: L3087-L3127

Signature

def skip(self, n: int) -> "IterableDataset":

Import

from datasets import load_dataset

ds = load_dataset("my_dataset", split="train", streaming=True)
# skip is a method on the returned IterableDataset
ds = ds.skip(1000)

I/O Contract

Inputs

Name Type Required Description
n int Yes Number of elements to skip from the beginning of the stream.

Outputs

Name Type Description
dataset IterableDataset A new streaming dataset that begins yielding after the first n elements.

Usage Examples

Basic Usage

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True)

# Skip the first element
ds = ds.skip(1)
list(ds.take(3))
# [{'label': 1, 'text': 'the gorgeously elaborate continuation of "the lord of the rings" ...'},
#  {'label': 1, 'text': 'effective but too-tepid biopic'},
#  {'label': 1, 'text': 'if you sometimes like to go to the movies to have fun ...'}]

# Create train/validation split
val_ds = ds.take(500)
train_ds = ds.skip(500)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment