Implementation:Datajuicer Data juicer AverageLineLengthFilter
| Knowledge Sources | |
|---|---|
| Domains | Data_Quality, Filtering |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for filtering data samples based on average line length provided by Data-Juicer.
Description
AverageLineLengthFilter is a filter operator that keeps samples with average line length within a specific range. It calculates the average line length as total text length divided by the number of lines. If context is provided, it uses precomputed lines. The computed average line length is stored under the avg_line_length stats key. It extends the Filter base class and implements the two-phase compute_stats/process pattern.
Usage
Import this operator when you need to filter dataset samples based on the average number of characters per line. Configure it in your Data-Juicer YAML config or instantiate directly.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/filter/average_line_length_filter.py
- Lines: 1-67
Signature
@OPERATORS.register_module("average_line_length_filter")
@INTER_LINES.register_module("average_line_length_filter")
class AverageLineLengthFilter(Filter):
def __init__(self, min_len: int = 10, max_len: int = sys.maxsize, *args, **kwargs):
...
Import
from data_juicer.ops.filter.average_line_length_filter import AverageLineLengthFilter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| min_len | int | No | The minimum average line length; samples below this are filtered out. Default: 10 |
| max_len | int | No | The maximum average line length; samples above this are filtered out. Default: sys.maxsize |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Filtered samples with stats field updated (avg_line_length) |
Usage Examples
YAML Configuration
process:
- average_line_length_filter:
min_len: 10
max_len: 10000
Python API
from data_juicer.ops.filter.average_line_length_filter import AverageLineLengthFilter
op = AverageLineLengthFilter(min_len=10, max_len=10000)
# Apply to dataset
result = dataset.process(op)