Implementation:Datajuicer Data juicer RangeSpecifiedFieldSelector

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data_Processing, Selection
Last Updated	2026-02-14 16:00 GMT

Overview

Concrete tool for selecting samples within a specified range of field values provided by Data-Juicer.

Description

RangeSpecifiedFieldSelector extends Selector and filters dataset samples by keeping only those whose specified field values fall within a given rank or percentile range. It computes lower and upper bounds from percentile and/or rank parameters, using the more inclusive bound when both are provided. The field values are extracted supporting dot-separated multi-level keys, then heapq.nsmallest and heapq.nlargest are used in two passes to select samples within the computed range. If no bounds are provided, the original dataset is returned.

Usage

Use when you need range-based data selection strategies, such as filtering training data to specific quality or metric bands, removing outliers by keeping only the middle portion of a distribution.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/ops/selector/range_specified_field_selector.py

Signature

@OPERATORS.register_module("range_specified_field_selector")
class RangeSpecifiedFieldSelector(Selector):
    def __init__(self, field_key: str = "",
                 lower_percentile: Optional[float] = None,
                 upper_percentile: Optional[float] = None,
                 lower_rank: Optional[PositiveInt] = None,
                 upper_rank: Optional[PositiveInt] = None,
                 *args, **kwargs):

Import

from data_juicer.ops.selector.range_specified_field_selector import RangeSpecifiedFieldSelector

I/O Contract

Inputs

Name	Type	Required	Description
field_key	str	No	Target field key. Multi-level fields separated by '.'. Default: ""
lower_percentile	float	No	Lower bound percentile (0 to 1). Default: None
upper_percentile	float	No	Upper bound percentile (0 to 1). Default: None
lower_rank	PositiveInt	No	Lower bound rank (absolute position). Default: None
upper_rank	PositiveInt	No	Upper bound rank (absolute position). Default: None

Outputs

Name	Type	Description
dataset	Dataset	Filtered dataset containing only samples within the specified range

Usage Examples

process:
  - range_specified_field_selector:
      field_key: "__dj__stats__.text_len"
      lower_percentile: 0.2
      upper_percentile: 0.8

Related Pages

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment