Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Dataset Filter

From Leeroopedia
Revision as of 12:58, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Huggingface_Datasets_Dataset_Filter.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Data_Engineering, ML_Preprocessing
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for selecting dataset rows based on a predicate function provided by the HuggingFace Datasets library.

Description

The filter method applies a boolean predicate function to all examples in the dataset and returns a new dataset containing only the examples for which the predicate returns True. Internally, it uses map to compute indices of matching rows and then creates an indices mapping over the original data, so the underlying data is not copied. The method supports both element-wise and batched predicates, multiprocessing, caching, and asynchronous functions. If no function is provided, it defaults to an always-true predicate.

Usage

Use Dataset.filter when you need to remove examples that do not meet quality criteria, select examples of a particular class, or create focused subsets based on any boolean condition over the data.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/arrow_dataset.py
  • Lines: L3810-L3956

Signature

@transmit_format
@fingerprint_transform(
    inplace=False, ignore_kwargs=["load_from_cache_file", "cache_file_name", "desc"], version="2.0.1"
)
def filter(
    self,
    function: Optional[Callable] = None,
    with_indices: bool = False,
    with_rank: bool = False,
    input_columns: Optional[Union[str, list[str]]] = None,
    batched: bool = False,
    batch_size: Optional[int] = 1000,
    keep_in_memory: bool = False,
    load_from_cache_file: Optional[bool] = None,
    cache_file_name: Optional[str] = None,
    writer_batch_size: Optional[int] = 1000,
    fn_kwargs: Optional[dict] = None,
    num_proc: Optional[int] = None,
    suffix_template: str = "_{rank:05d}_of_{num_proc:05d}",
    new_fingerprint: Optional[str] = None,
    desc: Optional[str] = None,
) -> "Dataset":

Import

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
ds = ds.filter(lambda x: x["label"] == 1)

I/O Contract

Inputs

Name Type Required Description
function Optional[Callable] No Predicate function returning bool (element-wise) or List[bool] (batched). Defaults to always True.
with_indices bool No Provide example indices to function. Defaults to False.
with_rank bool No Provide process rank to function. Defaults to False.
input_columns Optional[Union[str, list[str]]] No Columns to pass as positional arguments.
batched bool No Whether to provide batches of examples to function. Defaults to False.
batch_size Optional[int] No Number of examples per batch. Defaults to 1000.
keep_in_memory bool No Keep result in memory. Defaults to False.
load_from_cache_file Optional[bool] No Use cached result if available. Defaults to True if caching is enabled.
cache_file_name Optional[str] No Path for the cache file.
writer_batch_size Optional[int] No Rows per write operation. Defaults to 1000.
fn_kwargs Optional[dict] No Keyword arguments passed to function.
num_proc Optional[int] No Number of processes for multiprocessing.
suffix_template str No Suffix template for shard cache files.
new_fingerprint Optional[str] No The new fingerprint after transform.
desc Optional[str] No Description displayed alongside the progress bar.

Outputs

Name Type Description
return Dataset A new dataset containing only the rows where the predicate returned True.

Usage Examples

Basic Usage

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")

# Filter for positive reviews only
ds_positive = ds.filter(lambda x: x["label"] == 1)
print(ds_positive.num_rows)
# 533

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment