Principle:Datajuicer Data juicer Data Selection

Domains	Data_Processing, Data_Curation
Last Updated	2026-02-14 17:00 GMT

Overview

A global dataset-level operation pattern that selects a subset of samples based on ranking, frequency, tag matching, range criteria, or random sampling, reducing dataset size while preserving desired characteristics.

Pattern

Selector operators extend the Selector base class and operate on the entire dataset to select a subset of samples. Unlike Filters (which make per-sample keep/discard decisions based on computed statistics), Selectors make global decisions that require knowledge of the full dataset distribution. The pattern follows:

1. Field Extraction -- Extract values from a specified field key across all samples, supporting dot-separated multi-level key paths for accessing nested fields (e.g., __dj__stats__.text_len).

2. Ranking/Matching -- Apply a selection strategy:

  * Top-K -- Use heapq.nlargest/heapq.nsmallest to select highest/lowest scoring samples
  * Frequency -- Count value frequencies and keep samples with the most/least common values
  * Range -- Select samples within percentile or rank bounds
  * Tags -- Keep samples matching a set of target tag values
  * Random -- Uniform random sampling by ratio or count

3. Subset Selection -- Use dataset.select(indices) to produce the filtered output dataset.

Selection size is controlled via dual parameters: top_ratio (fraction) and topk/select_num (absolute count), using whichever yields fewer samples when both are provided.

Key Characteristics

Global dataset operation (requires access to all samples for ranking/distribution analysis)
Dot-separated multi-level field key access for nested statistics
Dual selection size control: ratio-based and count-based with min-of-both semantics
Multiple selection strategies: top-k, frequency, range, tags, random
Efficient algorithms: heap-based selection (O(n log k)), set-based membership testing
Configurable sort order (ascending/descending) for ranked selections
Dataset returned unchanged when fewer than 2 samples or no selection criteria provided

Implementations

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment