Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Datajuicer Data juicer Data Grouping

From Leeroopedia
Domains Data_Processing, Data_Organization
Last Updated 2026-02-14 17:00 GMT

Overview

A batch assembly pattern that transforms individual dataset samples into grouped batches for downstream aggregate processing, and reverses the grouping back to individual samples afterward.

Pattern

Grouper operators extend the Grouper base class and transform datasets between per-sample and per-batch representations. The pattern supports three complementary operations:

1. Naive Grouping -- Combines all samples into a single batch, converting a list of dictionaries into a single dictionary of lists via convert_list_dict_to_dict_list. This is the simplest grouping for global aggregation.

2. Key-Value Grouping -- Groups samples by shared values in one or more specified fields (supporting nested key access via dot notation). Samples with identical key-value combinations are hashed together and placed in the same batch, enabling attribute-based partitioning.

3. Reverse Grouping -- The inverse operation that splits batched dict-of-lists back into individual dict records via convert_dict_list_to_list_dict, with optional export of batch-level metadata to JSON lines files.

Groupers serve as the bridge between per-sample operators (Filters, Mappers) and batch-level operators (Aggregators), forming a Group -> Aggregate -> Ungroup pipeline pattern.

Key Characteristics

  • Transforms between per-sample and per-batch data representations
  • Supports identity grouping (all samples in one batch) and attribute-based partitioning
  • Nested key access via dot notation for grouping by computed statistics or metadata
  • Reversible: NaiveReverseGrouper undoes the batching with optional metadata export
  • Registered via @OPERATORS.register_module() and configured through YAML
  • Essential prerequisite for Aggregator operators that process related samples together

Implementations

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment