Implementation:Huggingface Datatrove LambdaFilter

Knowledge Sources	Huggingface_Datatrove
Domains	Data Processing, Text Filtering
Last Updated	2026-02-14 17:00 GMT

Overview

LambdaFilter is a lightweight document filter that delegates its keep/drop decision to an arbitrary user-supplied callable function, enabling inline filter logic without requiring a dedicated subclass.

Description

LambdaFilter extends BaseFilter to provide the simplest possible mechanism for custom filtering. Instead of requiring users to define a new class and implement the filter method, LambdaFilter accepts any callable (lambda function, regular function, or callable object) that takes a Document and returns a boolean. The callable is stored as the filter_function attribute and invoked directly in the filter method.

This design follows the Strategy pattern, where the filtering algorithm is injected at construction time rather than defined through class inheritance. It is particularly useful for quick, one-off filters based on metadata checks, text length thresholds, or other simple conditions that do not warrant their own dedicated filter class.

The filter inherits all standard BaseFilter behavior including statistics tracking, exclusion writing, and pipeline integration. The only limitation compared to a full subclass is that the callable must be serializable if the pipeline is distributed across processes (plain lambdas may not serialize; named functions or callable objects are safer in distributed settings).

Usage

Use LambdaFilter for ad-hoc or prototype filtering logic where creating a dedicated filter subclass would be excessive. It is ideal for quick experiments, metadata-based filtering, or any simple boolean condition on documents.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/pipeline/filters/lambda_filter.py
Lines: 1-29

Signature

class LambdaFilter(BaseFilter):
    name = "👤 Lambda"

    def __init__(
        self,
        filter_function: Callable[[Document], bool],
        exclusion_writer: DiskWriter = None,
    ):
        ...

    def filter(self, doc: Document) -> bool:
        ...

Import

from datatrove.pipeline.filters.lambda_filter import LambdaFilter

I/O Contract

Inputs

Name	Type	Required	Description
filter_function	Callable[[Document], bool]	Yes	A callable that receives a Document and returns True to keep or False to drop
exclusion_writer	DiskWriter	No	Optional writer for saving dropped documents

Outputs

Name	Type	Description
data	DocumentsPipeline (generator)	Yields documents for which the filter_function returned True

Usage Examples

Basic Usage

from datatrove.pipeline.filters.lambda_filter import LambdaFilter

# Keep only documents with at least 500 characters
length_filter = LambdaFilter(filter_function=lambda doc: len(doc.text) >= 500)

# Keep only documents that have a specific metadata field
metadata_filter = LambdaFilter(
    filter_function=lambda doc: doc.metadata.get("source") == "wikipedia"
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment