Implementation:Huggingface Datasets IterableDataset Filter
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for lazily filtering streaming dataset elements based on predicate functions provided by the HuggingFace Datasets library.
Description
IterableDataset.filter wraps the dataset's internal example iterable with a FilteredExamplesIterable. The predicate function is stored but not evaluated until the dataset is iterated. At iteration time, each element (or batch) is passed through the predicate, and only elements for which the predicate returns True are yielded.
Internally, the method:
- Normalizes
input_columnsfrom string to list. - If features or formatting are present, wraps the iterable with
FormattedExamplesIterableto ensure the predicate receives properly decoded data (important for feature types like Image and Audio). - Wraps the iterable in
FilteredExamplesIterablewith the predicate function and configuration. - Returns a new
IterableDatasetpreserving the original info, split, formatting, and distributed settings.
If the predicate function is asynchronous, the filter operation runs it in parallel with up to one thousand simultaneous calls.
Usage
Use IterableDataset.filter when you need to select a subset of streaming examples based on their content, labels, or other properties without downloading the full dataset.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/iterable_dataset.py - Lines: L2930-L3013
Signature
def filter(
self,
function: Optional[Callable] = None,
with_indices=False,
input_columns: Optional[Union[str, list[str]]] = None,
batched: bool = False,
batch_size: Optional[int] = 1000,
fn_kwargs: Optional[dict] = None,
) -> "IterableDataset":
Import
from datasets import load_dataset
ds = load_dataset("my_dataset", split="train", streaming=True)
# filter is a method on the returned IterableDataset
ds = ds.filter(my_predicate)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| function | Optional[Callable] |
No | Predicate function returning bool (or list of bools if batched). Defaults to always True. |
| with_indices | bool |
No | If True, passes element indices to the function. Defaults to False. |
| input_columns | Optional[Union[str, list[str]]] |
No | Columns to pass as positional arguments. If None, passes entire example dict. |
| batched | bool |
No | If True, provides batches of examples to the function. Defaults to False. |
| batch_size | Optional[int] |
No | Number of examples per batch when batched=True. Defaults to 1000.
|
| fn_kwargs | Optional[dict] |
No | Additional keyword arguments passed to the function. |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | IterableDataset |
A new streaming dataset that yields only elements passing the predicate. |
Usage Examples
Basic Usage
from datasets import load_dataset
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True)
# Keep only negative reviews (label == 0)
ds = ds.filter(lambda x: x["label"] == 0)
list(ds.take(3))
# [{'label': 0, 'text': 'simplistic , silly and tedious .'}, ...]