Implementation:Datajuicer Data juicer AverageLineLengthFilter

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data_Quality, Filtering
Last Updated	2026-02-14 16:00 GMT

Overview

Concrete tool for filtering data samples based on average line length provided by Data-Juicer.

Description

AverageLineLengthFilter is a filter operator that keeps samples with average line length within a specific range. It calculates the average line length as total text length divided by the number of lines. If context is provided, it uses precomputed lines. The computed average line length is stored under the avg_line_length stats key. It extends the Filter base class and implements the two-phase compute_stats/process pattern.

Usage

Import this operator when you need to filter dataset samples based on the average number of characters per line. Configure it in your Data-Juicer YAML config or instantiate directly.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/ops/filter/average_line_length_filter.py
Lines: 1-67

Signature

@OPERATORS.register_module("average_line_length_filter")
@INTER_LINES.register_module("average_line_length_filter")
class AverageLineLengthFilter(Filter):
    def __init__(self, min_len: int = 10, max_len: int = sys.maxsize, *args, **kwargs):
        ...

Import

from data_juicer.ops.filter.average_line_length_filter import AverageLineLengthFilter

I/O Contract

Inputs

Name	Type	Required	Description
min_len	int	No	The minimum average line length; samples below this are filtered out. Default: 10
max_len	int	No	The maximum average line length; samples above this are filtered out. Default: sys.maxsize

Outputs

Name	Type	Description
samples	Dict	Filtered samples with stats field updated (avg_line_length)

Usage Examples

YAML Configuration

process:
  - average_line_length_filter:
      min_len: 10
      max_len: 10000

Python API

from data_juicer.ops.filter.average_line_length_filter import AverageLineLengthFilter

op = AverageLineLengthFilter(min_len=10, max_len=10000)
# Apply to dataset
result = dataset.process(op)

Related Pages

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment