Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets IterableDataset Filter

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for lazily filtering streaming dataset elements based on predicate functions provided by the HuggingFace Datasets library.

Description

IterableDataset.filter wraps the dataset's internal example iterable with a FilteredExamplesIterable. The predicate function is stored but not evaluated until the dataset is iterated. At iteration time, each element (or batch) is passed through the predicate, and only elements for which the predicate returns True are yielded.

Internally, the method:

  1. Normalizes input_columns from string to list.
  2. If features or formatting are present, wraps the iterable with FormattedExamplesIterable to ensure the predicate receives properly decoded data (important for feature types like Image and Audio).
  3. Wraps the iterable in FilteredExamplesIterable with the predicate function and configuration.
  4. Returns a new IterableDataset preserving the original info, split, formatting, and distributed settings.

If the predicate function is asynchronous, the filter operation runs it in parallel with up to one thousand simultaneous calls.

Usage

Use IterableDataset.filter when you need to select a subset of streaming examples based on their content, labels, or other properties without downloading the full dataset.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/iterable_dataset.py
  • Lines: L2930-L3013

Signature

def filter(
    self,
    function: Optional[Callable] = None,
    with_indices=False,
    input_columns: Optional[Union[str, list[str]]] = None,
    batched: bool = False,
    batch_size: Optional[int] = 1000,
    fn_kwargs: Optional[dict] = None,
) -> "IterableDataset":

Import

from datasets import load_dataset

ds = load_dataset("my_dataset", split="train", streaming=True)
# filter is a method on the returned IterableDataset
ds = ds.filter(my_predicate)

I/O Contract

Inputs

Name Type Required Description
function Optional[Callable] No Predicate function returning bool (or list of bools if batched). Defaults to always True.
with_indices bool No If True, passes element indices to the function. Defaults to False.
input_columns Optional[Union[str, list[str]]] No Columns to pass as positional arguments. If None, passes entire example dict.
batched bool No If True, provides batches of examples to the function. Defaults to False.
batch_size Optional[int] No Number of examples per batch when batched=True. Defaults to 1000.
fn_kwargs Optional[dict] No Additional keyword arguments passed to the function.

Outputs

Name Type Description
dataset IterableDataset A new streaming dataset that yields only elements passing the predicate.

Usage Examples

Basic Usage

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train", streaming=True)

# Keep only negative reviews (label == 0)
ds = ds.filter(lambda x: x["label"] == 0)
list(ds.take(3))
# [{'label': 0, 'text': 'simplistic , silly and tedious .'}, ...]

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment