Implementation:Huggingface Datasets IterableDataset Skip
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for skipping a fixed number of elements from a streaming dataset provided by the HuggingFace Datasets library.
Description
IterableDataset.skip wraps the dataset's internal example iterable with a SkipExamplesIterable. This iterable consumes and discards the first n elements from the source during iteration, then yields all subsequent elements normally.
Internally, the method:
- Creates a
SkipExamplesIterablewrapping the existing_ex_iterablewith the count n. - Sets
split_when_sharding=Trueif the dataset is not in distributed mode (self._distributed is None), allowing the skip count to be divided across data-loading workers. - Returns a new
IterableDatasetwith the wrapped iterable, preserving info, split, formatting, and distributed settings.
Like take, the skip operation fixes the shard order to maintain deterministic behavior.
Usage
Use IterableDataset.skip when you need to bypass the first n elements of a streaming dataset, whether for resuming from a checkpoint, creating train/validation splits, or implementing offset-based access patterns.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/iterable_dataset.py - Lines: L3087-L3127
Signature
def skip(self, n: int) -> "IterableDataset":
Import
from datasets import load_dataset
ds = load_dataset("my_dataset", split="train", streaming=True)
# skip is a method on the returned IterableDataset
ds = ds.skip(1000)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| n | int |
Yes | Number of elements to skip from the beginning of the stream. |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | IterableDataset |
A new streaming dataset that begins yielding after the first n elements. |
Usage Examples
Basic Usage
from datasets import load_dataset
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True)
# Skip the first element
ds = ds.skip(1)
list(ds.take(3))
# [{'label': 1, 'text': 'the gorgeously elaborate continuation of "the lord of the rings" ...'},
# {'label': 1, 'text': 'effective but too-tepid biopic'},
# {'label': 1, 'text': 'if you sometimes like to go to the movies to have fun ...'}]
# Create train/validation split
val_ds = ds.take(500)
train_ds = ds.skip(500)