Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer AverageLineLengthFilter

From Leeroopedia
Knowledge Sources
Domains Data_Quality, Filtering
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for filtering data samples based on average line length provided by Data-Juicer.

Description

AverageLineLengthFilter is a filter operator that keeps samples with average line length within a specific range. It calculates the average line length as total text length divided by the number of lines. If context is provided, it uses precomputed lines. The computed average line length is stored under the avg_line_length stats key. It extends the Filter base class and implements the two-phase compute_stats/process pattern.

Usage

Import this operator when you need to filter dataset samples based on the average number of characters per line. Configure it in your Data-Juicer YAML config or instantiate directly.

Code Reference

Source Location

Signature

@OPERATORS.register_module("average_line_length_filter")
@INTER_LINES.register_module("average_line_length_filter")
class AverageLineLengthFilter(Filter):
    def __init__(self, min_len: int = 10, max_len: int = sys.maxsize, *args, **kwargs):
        ...

Import

from data_juicer.ops.filter.average_line_length_filter import AverageLineLengthFilter

I/O Contract

Inputs

Name Type Required Description
min_len int No The minimum average line length; samples below this are filtered out. Default: 10
max_len int No The maximum average line length; samples above this are filtered out. Default: sys.maxsize

Outputs

Name Type Description
samples Dict Filtered samples with stats field updated (avg_line_length)

Usage Examples

YAML Configuration

process:
  - average_line_length_filter:
      min_len: 10
      max_len: 10000

Python API

from data_juicer.ops.filter.average_line_length_filter import AverageLineLengthFilter

op = AverageLineLengthFilter(min_len=10, max_len=10000)
# Apply to dataset
result = dataset.process(op)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment