Implementation:Datajuicer Data juicer FrequencySpecifiedFieldSelector
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Selection |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for filtering samples based on field value frequency provided by Data-Juicer.
Description
FrequencySpecifiedFieldSelector extends Selector and filters dataset samples based on the frequency (occurrence count) of values in a specified field. It builds a dictionary mapping each unique field value to the indices of samples containing that value, then sorts value groups by frequency and selects the top groups based on either a top_ratio (percentage of unique values) or a fixed topk count, whichever yields fewer samples. The field can be multi-level with keys separated by dots. The sorting order can be controlled with the reverse parameter (default descending).
Usage
Use when you need frequency-based data curation strategies, such as keeping only samples with the most common categories or labels, which is useful for balancing or focusing training datasets.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/selector/frequency_specified_field_selector.py
Signature
@OPERATORS.register_module("frequency_specified_field_selector")
class FrequencySpecifiedFieldSelector(Selector):
def __init__(self, field_key: str = "",
top_ratio: Optional[float] = None,
topk: Optional[PositiveInt] = None,
reverse: bool = True,
*args, **kwargs):
Import
from data_juicer.ops.selector.frequency_specified_field_selector import FrequencySpecifiedFieldSelector
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| field_key | str | No | Target field key. Multi-level fields separated by '.'. Default: "" |
| top_ratio | float | No | Ratio of top field values to select (0 to 1). Default: None |
| topk | PositiveInt | No | Number of top field values to select. Default: None |
| reverse | bool | No | Sort order: True for descending (most frequent first). Default: True |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dataset | Filtered dataset containing only samples whose field values are among the most frequent |
Usage Examples
process:
- frequency_specified_field_selector:
field_key: "__dj__stats__.lang"
topk: 5
reverse: true