Implementation:Huggingface Datasets Dataset Filter
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, ML_Preprocessing |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for selecting dataset rows based on a predicate function provided by the HuggingFace Datasets library.
Description
The filter method applies a boolean predicate function to all examples in the dataset and returns a new dataset containing only the examples for which the predicate returns True. Internally, it uses map to compute indices of matching rows and then creates an indices mapping over the original data, so the underlying data is not copied. The method supports both element-wise and batched predicates, multiprocessing, caching, and asynchronous functions. If no function is provided, it defaults to an always-true predicate.
Usage
Use Dataset.filter when you need to remove examples that do not meet quality criteria, select examples of a particular class, or create focused subsets based on any boolean condition over the data.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/arrow_dataset.py - Lines: L3810-L3956
Signature
@transmit_format
@fingerprint_transform(
inplace=False, ignore_kwargs=["load_from_cache_file", "cache_file_name", "desc"], version="2.0.1"
)
def filter(
self,
function: Optional[Callable] = None,
with_indices: bool = False,
with_rank: bool = False,
input_columns: Optional[Union[str, list[str]]] = None,
batched: bool = False,
batch_size: Optional[int] = 1000,
keep_in_memory: bool = False,
load_from_cache_file: Optional[bool] = None,
cache_file_name: Optional[str] = None,
writer_batch_size: Optional[int] = 1000,
fn_kwargs: Optional[dict] = None,
num_proc: Optional[int] = None,
suffix_template: str = "_{rank:05d}_of_{num_proc:05d}",
new_fingerprint: Optional[str] = None,
desc: Optional[str] = None,
) -> "Dataset":
Import
from datasets import load_dataset
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
ds = ds.filter(lambda x: x["label"] == 1)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| function | Optional[Callable] |
No | Predicate function returning bool (element-wise) or List[bool] (batched). Defaults to always True.
|
| with_indices | bool |
No | Provide example indices to function. Defaults to False.
|
| with_rank | bool |
No | Provide process rank to function. Defaults to False.
|
| input_columns | Optional[Union[str, list[str]]] |
No | Columns to pass as positional arguments. |
| batched | bool |
No | Whether to provide batches of examples to function. Defaults to False.
|
| batch_size | Optional[int] |
No | Number of examples per batch. Defaults to 1000. |
| keep_in_memory | bool |
No | Keep result in memory. Defaults to False.
|
| load_from_cache_file | Optional[bool] |
No | Use cached result if available. Defaults to True if caching is enabled.
|
| cache_file_name | Optional[str] |
No | Path for the cache file. |
| writer_batch_size | Optional[int] |
No | Rows per write operation. Defaults to 1000. |
| fn_kwargs | Optional[dict] |
No | Keyword arguments passed to function.
|
| num_proc | Optional[int] |
No | Number of processes for multiprocessing. |
| suffix_template | str |
No | Suffix template for shard cache files. |
| new_fingerprint | Optional[str] |
No | The new fingerprint after transform. |
| desc | Optional[str] |
No | Description displayed alongside the progress bar. |
Outputs
| Name | Type | Description |
|---|---|---|
| return | Dataset |
A new dataset containing only the rows where the predicate returned True.
|
Usage Examples
Basic Usage
from datasets import load_dataset
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
# Filter for positive reviews only
ds_positive = ds.filter(lambda x: x["label"] == 1)
print(ds_positive.num_rows)
# 533