Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer RandomSelector

From Leeroopedia
Revision as of 12:22, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Datajuicer_Data_juicer_RandomSelector.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Data_Processing, Selection
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for randomly selecting a subset of samples from a dataset provided by Data-Juicer.

Description

RandomSelector extends Selector and randomly selects a subset of samples based on either a specified ratio or a fixed number. It computes the desired sample count from select_ratio (fraction of dataset) or select_num (absolute count), using whichever yields fewer samples when both are provided, then delegates to the random_sample utility function to perform the actual random selection. If neither select_ratio nor select_num is set, the dataset remains unchanged. Selection is skipped if the dataset has one or fewer samples.

Usage

Use when you need a straightforward downsampling mechanism to reduce dataset size randomly for experimentation, testing, or resource management in data processing pipelines.

Code Reference

Source Location

Signature

@OPERATORS.register_module("random_selector")
class RandomSelector(Selector):
    def __init__(self, select_ratio: Optional[float] = None,
                 select_num: Optional[PositiveInt] = None,
                 *args, **kwargs):

Import

from data_juicer.ops.selector.random_selector import RandomSelector

I/O Contract

Inputs

Name Type Required Description
select_ratio float No Ratio of samples to select (0 to 1). Default: None
select_num PositiveInt No Exact number of samples to select. Default: None

Outputs

Name Type Description
dataset Dataset Randomly sampled subset of the input dataset

Usage Examples

process:
  - random_selector:
      select_ratio: 0.1

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment