Implementation:Datajuicer Data juicer RangeSpecifiedFieldSelector
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Selection |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for selecting samples within a specified range of field values provided by Data-Juicer.
Description
RangeSpecifiedFieldSelector extends Selector and filters dataset samples by keeping only those whose specified field values fall within a given rank or percentile range. It computes lower and upper bounds from percentile and/or rank parameters, using the more inclusive bound when both are provided. The field values are extracted supporting dot-separated multi-level keys, then heapq.nsmallest and heapq.nlargest are used in two passes to select samples within the computed range. If no bounds are provided, the original dataset is returned.
Usage
Use when you need range-based data selection strategies, such as filtering training data to specific quality or metric bands, removing outliers by keeping only the middle portion of a distribution.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/selector/range_specified_field_selector.py
Signature
@OPERATORS.register_module("range_specified_field_selector")
class RangeSpecifiedFieldSelector(Selector):
def __init__(self, field_key: str = "",
lower_percentile: Optional[float] = None,
upper_percentile: Optional[float] = None,
lower_rank: Optional[PositiveInt] = None,
upper_rank: Optional[PositiveInt] = None,
*args, **kwargs):
Import
from data_juicer.ops.selector.range_specified_field_selector import RangeSpecifiedFieldSelector
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| field_key | str | No | Target field key. Multi-level fields separated by '.'. Default: "" |
| lower_percentile | float | No | Lower bound percentile (0 to 1). Default: None |
| upper_percentile | float | No | Upper bound percentile (0 to 1). Default: None |
| lower_rank | PositiveInt | No | Lower bound rank (absolute position). Default: None |
| upper_rank | PositiveInt | No | Upper bound rank (absolute position). Default: None |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dataset | Filtered dataset containing only samples within the specified range |
Usage Examples
process:
- range_specified_field_selector:
field_key: "__dj__stats__.text_len"
lower_percentile: 0.2
upper_percentile: 0.8