Implementation:Huggingface Datatrove LambdaFilter
| Knowledge Sources | |
|---|---|
| Domains | Data Processing, Text Filtering |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
LambdaFilter is a lightweight document filter that delegates its keep/drop decision to an arbitrary user-supplied callable function, enabling inline filter logic without requiring a dedicated subclass.
Description
LambdaFilter extends BaseFilter to provide the simplest possible mechanism for custom filtering. Instead of requiring users to define a new class and implement the filter method, LambdaFilter accepts any callable (lambda function, regular function, or callable object) that takes a Document and returns a boolean. The callable is stored as the filter_function attribute and invoked directly in the filter method.
This design follows the Strategy pattern, where the filtering algorithm is injected at construction time rather than defined through class inheritance. It is particularly useful for quick, one-off filters based on metadata checks, text length thresholds, or other simple conditions that do not warrant their own dedicated filter class.
The filter inherits all standard BaseFilter behavior including statistics tracking, exclusion writing, and pipeline integration. The only limitation compared to a full subclass is that the callable must be serializable if the pipeline is distributed across processes (plain lambdas may not serialize; named functions or callable objects are safer in distributed settings).
Usage
Use LambdaFilter for ad-hoc or prototype filtering logic where creating a dedicated filter subclass would be excessive. It is ideal for quick experiments, metadata-based filtering, or any simple boolean condition on documents.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/pipeline/filters/lambda_filter.py
- Lines: 1-29
Signature
class LambdaFilter(BaseFilter):
name = "👤 Lambda"
def __init__(
self,
filter_function: Callable[[Document], bool],
exclusion_writer: DiskWriter = None,
):
...
def filter(self, doc: Document) -> bool:
...
Import
from datatrove.pipeline.filters.lambda_filter import LambdaFilter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| filter_function | Callable[[Document], bool] | Yes | A callable that receives a Document and returns True to keep or False to drop |
| exclusion_writer | DiskWriter | No | Optional writer for saving dropped documents |
Outputs
| Name | Type | Description |
|---|---|---|
| data | DocumentsPipeline (generator) | Yields documents for which the filter_function returned True |
Usage Examples
Basic Usage
from datatrove.pipeline.filters.lambda_filter import LambdaFilter
# Keep only documents with at least 500 characters
length_filter = LambdaFilter(filter_function=lambda doc: len(doc.text) >= 500)
# Keep only documents that have a specific metadata field
metadata_filter = LambdaFilter(
filter_function=lambda doc: doc.metadata.get("source") == "wikipedia"
)