Implementation:Datajuicer Data juicer GeneralFieldFilter
| Knowledge Sources | |
|---|---|
| Domains | Data_Quality, Filtering |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for filtering data samples based on a general field filter condition provided by Data-Juicer.
Description
GeneralFieldFilter is a filter operator that keeps samples based on a general field filter condition expressed as a string. The condition can include logical operators (and/or) and chain comparisons, e.g., "10 < num <= 30 and text != 'nothing here'". The condition is parsed using Python's ast module and evaluated for each sample. The result is stored under the general_field_filter_condition stats key. It extends the Filter base class and implements the two-phase compute_stats/process pattern.
Usage
Import this operator when you need to filter dataset samples based on arbitrary field conditions using logical expressions. Configure it in your Data-Juicer YAML config or instantiate directly.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/filter/general_field_filter.py
- Lines: 1-125
Signature
@OPERATORS.register_module("general_field_filter")
class GeneralFieldFilter(Filter):
def __init__(self, filter_condition: str = "", *args, **kwargs):
...
Import
from data_juicer.ops.filter.general_field_filter import GeneralFieldFilter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| filter_condition | str | No | The filter condition as a string supporting logical operators (and/or) and chain comparisons. Default: "" |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Filtered samples with stats field updated (general_field_filter_condition) |
Usage Examples
YAML Configuration
process:
- general_field_filter:
filter_condition: "10 < num <= 30 and text != 'nothing here'"
Python API
from data_juicer.ops.filter.general_field_filter import GeneralFieldFilter
op = GeneralFieldFilter(filter_condition="10 < num <= 30")
# Apply to dataset
result = dataset.process(op)