Implementation:Huggingface Datasets Dataset Filter

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, ML_Preprocessing
Last Updated	2026-02-14 18:00 GMT

Overview

Concrete tool for selecting dataset rows based on a predicate function provided by the HuggingFace Datasets library.

Description

The filter method applies a boolean predicate function to all examples in the dataset and returns a new dataset containing only the examples for which the predicate returns True. Internally, it uses map to compute indices of matching rows and then creates an indices mapping over the original data, so the underlying data is not copied. The method supports both element-wise and batched predicates, multiprocessing, caching, and asynchronous functions. If no function is provided, it defaults to an always-true predicate.

Usage

Use Dataset.filter when you need to remove examples that do not meet quality criteria, select examples of a particular class, or create focused subsets based on any boolean condition over the data.

Code Reference

Source Location

Repository: datasets
File: src/datasets/arrow_dataset.py
Lines: L3810-L3956

Signature

@transmit_format
@fingerprint_transform(
    inplace=False, ignore_kwargs=["load_from_cache_file", "cache_file_name", "desc"], version="2.0.1"
)
def filter(
    self,
    function: Optional[Callable] = None,
    with_indices: bool = False,
    with_rank: bool = False,
    input_columns: Optional[Union[str, list[str]]] = None,
    batched: bool = False,
    batch_size: Optional[int] = 1000,
    keep_in_memory: bool = False,
    load_from_cache_file: Optional[bool] = None,
    cache_file_name: Optional[str] = None,
    writer_batch_size: Optional[int] = 1000,
    fn_kwargs: Optional[dict] = None,
    num_proc: Optional[int] = None,
    suffix_template: str = "_{rank:05d}_of_{num_proc:05d}",
    new_fingerprint: Optional[str] = None,
    desc: Optional[str] = None,
) -> "Dataset":

Import

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
ds = ds.filter(lambda x: x["label"] == 1)

I/O Contract

Inputs

Name	Type	Required	Description
function	`Optional[Callable]`	No	Predicate function returning `bool` (element-wise) or `List[bool]` (batched). Defaults to always `True`.
with_indices	`bool`	No	Provide example indices to `function`. Defaults to `False`.
with_rank	`bool`	No	Provide process rank to `function`. Defaults to `False`.
input_columns	`Optional[Union[str, list[str]]]`	No	Columns to pass as positional arguments.
batched	`bool`	No	Whether to provide batches of examples to `function`. Defaults to `False`.
batch_size	`Optional[int]`	No	Number of examples per batch. Defaults to 1000.
keep_in_memory	`bool`	No	Keep result in memory. Defaults to `False`.
load_from_cache_file	`Optional[bool]`	No	Use cached result if available. Defaults to `True` if caching is enabled.
cache_file_name	`Optional[str]`	No	Path for the cache file.
writer_batch_size	`Optional[int]`	No	Rows per write operation. Defaults to 1000.
fn_kwargs	`Optional[dict]`	No	Keyword arguments passed to `function`.
num_proc	`Optional[int]`	No	Number of processes for multiprocessing.
suffix_template	`str`	No	Suffix template for shard cache files.
new_fingerprint	`Optional[str]`	No	The new fingerprint after transform.
desc	`Optional[str]`	No	Description displayed alongside the progress bar.

Outputs

Name	Type	Description
return	`Dataset`	A new dataset containing only the rows where the predicate returned `True`.

Usage Examples

Basic Usage

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")

# Filter for positive reviews only
ds_positive = ds.filter(lambda x: x["label"] == 1)
print(ds_positive.num_rows)
# 533

Related Pages

Implements Principle

Principle:Huggingface_Datasets_Dataset_Filtering

Requires Environment

Environment:Huggingface_Datasets_Python_PyArrow_Core

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment