Principle:Datajuicer Data juicer Data Selection
| Domains | Data_Processing, Data_Curation |
|---|---|
| Last Updated | 2026-02-14 17:00 GMT |
Overview
A global dataset-level operation pattern that selects a subset of samples based on ranking, frequency, tag matching, range criteria, or random sampling, reducing dataset size while preserving desired characteristics.
Pattern
Selector operators extend the Selector base class and operate on the entire dataset to select a subset of samples. Unlike Filters (which make per-sample keep/discard decisions based on computed statistics), Selectors make global decisions that require knowledge of the full dataset distribution. The pattern follows:
1. Field Extraction -- Extract values from a specified field key across all samples, supporting dot-separated multi-level key paths for accessing nested fields (e.g., __dj__stats__.text_len).
2. Ranking/Matching -- Apply a selection strategy:
* Top-K -- Useheapq.nlargest/heapq.nsmallestto select highest/lowest scoring samples * Frequency -- Count value frequencies and keep samples with the most/least common values * Range -- Select samples within percentile or rank bounds * Tags -- Keep samples matching a set of target tag values * Random -- Uniform random sampling by ratio or count
3. Subset Selection -- Use dataset.select(indices) to produce the filtered output dataset.
Selection size is controlled via dual parameters: top_ratio (fraction) and topk/select_num (absolute count), using whichever yields fewer samples when both are provided.
Key Characteristics
- Global dataset operation (requires access to all samples for ranking/distribution analysis)
- Dot-separated multi-level field key access for nested statistics
- Dual selection size control: ratio-based and count-based with min-of-both semantics
- Multiple selection strategies: top-k, frequency, range, tags, random
- Efficient algorithms: heap-based selection (O(n log k)), set-based membership testing
- Configurable sort order (ascending/descending) for ranked selections
- Dataset returned unchanged when fewer than 2 samples or no selection criteria provided
Implementations
- Implementation:Datajuicer_Data_juicer_TopkSpecifiedFieldSelector
- Implementation:Datajuicer_Data_juicer_FrequencySpecifiedFieldSelector
- Implementation:Datajuicer_Data_juicer_RangeSpecifiedFieldSelector
- Implementation:Datajuicer_Data_juicer_TagsSpecifiedFieldSelector
- Implementation:Datajuicer_Data_juicer_RandomSelector