Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datatrove LambdaFilter

From Leeroopedia
Knowledge Sources
Domains Data Processing, Text Filtering
Last Updated 2026-02-14 17:00 GMT

Overview

LambdaFilter is a lightweight document filter that delegates its keep/drop decision to an arbitrary user-supplied callable function, enabling inline filter logic without requiring a dedicated subclass.

Description

LambdaFilter extends BaseFilter to provide the simplest possible mechanism for custom filtering. Instead of requiring users to define a new class and implement the filter method, LambdaFilter accepts any callable (lambda function, regular function, or callable object) that takes a Document and returns a boolean. The callable is stored as the filter_function attribute and invoked directly in the filter method.

This design follows the Strategy pattern, where the filtering algorithm is injected at construction time rather than defined through class inheritance. It is particularly useful for quick, one-off filters based on metadata checks, text length thresholds, or other simple conditions that do not warrant their own dedicated filter class.

The filter inherits all standard BaseFilter behavior including statistics tracking, exclusion writing, and pipeline integration. The only limitation compared to a full subclass is that the callable must be serializable if the pipeline is distributed across processes (plain lambdas may not serialize; named functions or callable objects are safer in distributed settings).

Usage

Use LambdaFilter for ad-hoc or prototype filtering logic where creating a dedicated filter subclass would be excessive. It is ideal for quick experiments, metadata-based filtering, or any simple boolean condition on documents.

Code Reference

Source Location

Signature

class LambdaFilter(BaseFilter):
    name = "👤 Lambda"

    def __init__(
        self,
        filter_function: Callable[[Document], bool],
        exclusion_writer: DiskWriter = None,
    ):
        ...

    def filter(self, doc: Document) -> bool:
        ...

Import

from datatrove.pipeline.filters.lambda_filter import LambdaFilter

I/O Contract

Inputs

Name Type Required Description
filter_function Callable[[Document], bool] Yes A callable that receives a Document and returns True to keep or False to drop
exclusion_writer DiskWriter No Optional writer for saving dropped documents

Outputs

Name Type Description
data DocumentsPipeline (generator) Yields documents for which the filter_function returned True

Usage Examples

Basic Usage

from datatrove.pipeline.filters.lambda_filter import LambdaFilter

# Keep only documents with at least 500 characters
length_filter = LambdaFilter(filter_function=lambda doc: len(doc.text) >= 500)

# Keep only documents that have a specific metadata field
metadata_filter = LambdaFilter(
    filter_function=lambda doc: doc.metadata.get("source") == "wikipedia"
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment