Principle:Huggingface Datasets Streaming Skip
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Skipping a fixed number of elements from a streaming dataset enables offset-based access into an ordered data stream without materializing the skipped portion.
Description
The skip operation creates a new streaming dataset that discards the first n elements from the underlying stream and begins yielding from element n+1 onward. This is the streaming equivalent of slicing a list with [n:].
Key characteristics:
- Lazy offset: The skip operation does not consume elements at definition time. The first n elements are consumed and discarded only when iteration begins.
- Complementary to take: Together,
skip(n)andtake(n)partition a stream into two non-overlapping segments.ds.take(n)yields the first n elements, andds.skip(n)yields everything after. This enables patterns like train/validation splitting. - Shard order preservation: Like take, skip fixes the shard order, preventing subsequent shuffle operations from reordering shards.
- Distributed awareness: When used in a distributed context, the skip count can be split across nodes if
split_when_shardingis enabled.
Common use cases include:
- Resuming iteration from a known checkpoint (skip the examples already processed).
- Creating complementary train/validation splits: use
take(n)for validation andskip(n)for training. - Implementing pagination over a streaming dataset.
Usage
Use streaming skip when:
- You need to resume processing from a specific offset in the stream.
- You want to create a complementary split alongside a take operation.
- You are implementing checkpoint-based training where already-seen examples should be bypassed.
- You need offset-based pagination over a streaming data source.
Theoretical Basis
The skip operation corresponds to the suffix operation on sequences: given a stream S and a count n, skip(n) produces the sequence S[n], S[n+1], S[n+2], .... In combination with take, it provides a complete decomposition of the stream into prefix and suffix.
From a computational perspective, skip requires O(n) time to discard the first n elements but O(1) additional memory (the discarded elements are not stored). This is a fundamental trade-off in streaming systems: random access is not available, so offset-based access requires linear scanning. However, the scanning cost is incurred only once at the start of iteration, after which elements flow at full throughput.