Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer FrequencySpecifiedFieldSelector

From Leeroopedia
Knowledge Sources
Domains Data_Processing, Selection
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for filtering samples based on field value frequency provided by Data-Juicer.

Description

FrequencySpecifiedFieldSelector extends Selector and filters dataset samples based on the frequency (occurrence count) of values in a specified field. It builds a dictionary mapping each unique field value to the indices of samples containing that value, then sorts value groups by frequency and selects the top groups based on either a top_ratio (percentage of unique values) or a fixed topk count, whichever yields fewer samples. The field can be multi-level with keys separated by dots. The sorting order can be controlled with the reverse parameter (default descending).

Usage

Use when you need frequency-based data curation strategies, such as keeping only samples with the most common categories or labels, which is useful for balancing or focusing training datasets.

Code Reference

Source Location

Signature

@OPERATORS.register_module("frequency_specified_field_selector")
class FrequencySpecifiedFieldSelector(Selector):
    def __init__(self, field_key: str = "",
                 top_ratio: Optional[float] = None,
                 topk: Optional[PositiveInt] = None,
                 reverse: bool = True,
                 *args, **kwargs):

Import

from data_juicer.ops.selector.frequency_specified_field_selector import FrequencySpecifiedFieldSelector

I/O Contract

Inputs

Name Type Required Description
field_key str No Target field key. Multi-level fields separated by '.'. Default: ""
top_ratio float No Ratio of top field values to select (0 to 1). Default: None
topk PositiveInt No Number of top field values to select. Default: None
reverse bool No Sort order: True for descending (most frequent first). Default: True

Outputs

Name Type Description
dataset Dataset Filtered dataset containing only samples whose field values are among the most frequent

Usage Examples

process:
  - frequency_specified_field_selector:
      field_key: "__dj__stats__.lang"
      topk: 5
      reverse: true

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment