Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer RangeSpecifiedFieldSelector

From Leeroopedia
Knowledge Sources
Domains Data_Processing, Selection
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for selecting samples within a specified range of field values provided by Data-Juicer.

Description

RangeSpecifiedFieldSelector extends Selector and filters dataset samples by keeping only those whose specified field values fall within a given rank or percentile range. It computes lower and upper bounds from percentile and/or rank parameters, using the more inclusive bound when both are provided. The field values are extracted supporting dot-separated multi-level keys, then heapq.nsmallest and heapq.nlargest are used in two passes to select samples within the computed range. If no bounds are provided, the original dataset is returned.

Usage

Use when you need range-based data selection strategies, such as filtering training data to specific quality or metric bands, removing outliers by keeping only the middle portion of a distribution.

Code Reference

Source Location

Signature

@OPERATORS.register_module("range_specified_field_selector")
class RangeSpecifiedFieldSelector(Selector):
    def __init__(self, field_key: str = "",
                 lower_percentile: Optional[float] = None,
                 upper_percentile: Optional[float] = None,
                 lower_rank: Optional[PositiveInt] = None,
                 upper_rank: Optional[PositiveInt] = None,
                 *args, **kwargs):

Import

from data_juicer.ops.selector.range_specified_field_selector import RangeSpecifiedFieldSelector

I/O Contract

Inputs

Name Type Required Description
field_key str No Target field key. Multi-level fields separated by '.'. Default: ""
lower_percentile float No Lower bound percentile (0 to 1). Default: None
upper_percentile float No Upper bound percentile (0 to 1). Default: None
lower_rank PositiveInt No Lower bound rank (absolute position). Default: None
upper_rank PositiveInt No Upper bound rank (absolute position). Default: None

Outputs

Name Type Description
dataset Dataset Filtered dataset containing only samples within the specified range

Usage Examples

process:
  - range_specified_field_selector:
      field_key: "__dj__stats__.text_len"
      lower_percentile: 0.2
      upper_percentile: 0.8

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment