Implementation:Datajuicer Data juicer MaximumLineLengthFilter
| Knowledge Sources | |
|---|---|
| Domains | Data_Quality, Filtering |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for filtering data samples based on maximum line length provided by Data-Juicer.
Description
MaximumLineLengthFilter is a filter operator that keeps samples where the length of their longest line falls within a specified range. It extends Filter and uses the two-phase compute_stats/process pattern. In compute_stats_batched, it splits text by newlines (or reuses precomputed lines from context via INTER_LINES fusion), measures each line's length, takes the maximum, and caches it under the max_line_length stats key. Helps identify text with abnormal line structure such as minified code or data dumps with excessively long lines.
Usage
Import when filtering based on maximum line length. Configure in YAML or Python.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/filter/maximum_line_length_filter.py
Signature
@OPERATORS.register_module("maximum_line_length_filter")
class MaximumLineLengthFilter(Filter):
def __init__(self, min_len: int = 10, max_len: int = sys.maxsize, *args, **kwargs):
Import
from data_juicer.ops.filter.maximum_line_length_filter import MaximumLineLengthFilter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| min_len | int | No | Minimum maximum line length to keep samples (default: 10) |
| max_len | int | No | Maximum maximum line length to keep samples (default: sys.maxsize) |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Filtered samples with max_line_length stat computed |
Usage Examples
YAML Configuration
process:
- maximum_line_length_filter:
min_len: 10
max_len: 5000
Python API
from data_juicer.ops.filter.maximum_line_length_filter import MaximumLineLengthFilter
op = MaximumLineLengthFilter(min_len=10, max_len=5000)