Implementation:Datajuicer Data juicer RandomSelector
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Selection |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for randomly selecting a subset of samples from a dataset provided by Data-Juicer.
Description
RandomSelector extends Selector and randomly selects a subset of samples based on either a specified ratio or a fixed number. It computes the desired sample count from select_ratio (fraction of dataset) or select_num (absolute count), using whichever yields fewer samples when both are provided, then delegates to the random_sample utility function to perform the actual random selection. If neither select_ratio nor select_num is set, the dataset remains unchanged. Selection is skipped if the dataset has one or fewer samples.
Usage
Use when you need a straightforward downsampling mechanism to reduce dataset size randomly for experimentation, testing, or resource management in data processing pipelines.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/selector/random_selector.py
Signature
@OPERATORS.register_module("random_selector")
class RandomSelector(Selector):
def __init__(self, select_ratio: Optional[float] = None,
select_num: Optional[PositiveInt] = None,
*args, **kwargs):
Import
from data_juicer.ops.selector.random_selector import RandomSelector
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| select_ratio | float | No | Ratio of samples to select (0 to 1). Default: None |
| select_num | PositiveInt | No | Exact number of samples to select. Default: None |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dataset | Randomly sampled subset of the input dataset |
Usage Examples
process:
- random_selector:
select_ratio: 0.1