Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Datajuicer Data juicer Data Selection

From Leeroopedia
Domains Data_Processing, Data_Curation
Last Updated 2026-02-14 17:00 GMT

Overview

A global dataset-level operation pattern that selects a subset of samples based on ranking, frequency, tag matching, range criteria, or random sampling, reducing dataset size while preserving desired characteristics.

Pattern

Selector operators extend the Selector base class and operate on the entire dataset to select a subset of samples. Unlike Filters (which make per-sample keep/discard decisions based on computed statistics), Selectors make global decisions that require knowledge of the full dataset distribution. The pattern follows:

1. Field Extraction -- Extract values from a specified field key across all samples, supporting dot-separated multi-level key paths for accessing nested fields (e.g., __dj__stats__.text_len).

2. Ranking/Matching -- Apply a selection strategy:

  * Top-K -- Use heapq.nlargest/heapq.nsmallest to select highest/lowest scoring samples
  * Frequency -- Count value frequencies and keep samples with the most/least common values
  * Range -- Select samples within percentile or rank bounds
  * Tags -- Keep samples matching a set of target tag values
  * Random -- Uniform random sampling by ratio or count

3. Subset Selection -- Use dataset.select(indices) to produce the filtered output dataset.

Selection size is controlled via dual parameters: top_ratio (fraction) and topk/select_num (absolute count), using whichever yields fewer samples when both are provided.

Key Characteristics

  • Global dataset operation (requires access to all samples for ranking/distribution analysis)
  • Dot-separated multi-level field key access for nested statistics
  • Dual selection size control: ratio-based and count-based with min-of-both semantics
  • Multiple selection strategies: top-k, frequency, range, tags, random
  • Efficient algorithms: heap-based selection (O(n log k)), set-based membership testing
  • Configurable sort order (ascending/descending) for ranked selections
  • Dataset returned unchanged when fewer than 2 samples or no selection criteria provided

Implementations

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment