Principle:Datajuicer Data juicer Data Grouping

Domains	Data_Processing, Data_Organization
Last Updated	2026-02-14 17:00 GMT

Overview

A batch assembly pattern that transforms individual dataset samples into grouped batches for downstream aggregate processing, and reverses the grouping back to individual samples afterward.

Pattern

Grouper operators extend the Grouper base class and transform datasets between per-sample and per-batch representations. The pattern supports three complementary operations:

1. Naive Grouping -- Combines all samples into a single batch, converting a list of dictionaries into a single dictionary of lists via convert_list_dict_to_dict_list. This is the simplest grouping for global aggregation.

2. Key-Value Grouping -- Groups samples by shared values in one or more specified fields (supporting nested key access via dot notation). Samples with identical key-value combinations are hashed together and placed in the same batch, enabling attribute-based partitioning.

3. Reverse Grouping -- The inverse operation that splits batched dict-of-lists back into individual dict records via convert_dict_list_to_list_dict, with optional export of batch-level metadata to JSON lines files.

Groupers serve as the bridge between per-sample operators (Filters, Mappers) and batch-level operators (Aggregators), forming a Group -> Aggregate -> Ungroup pipeline pattern.

Key Characteristics

Transforms between per-sample and per-batch data representations
Supports identity grouping (all samples in one batch) and attribute-based partitioning
Nested key access via dot notation for grouping by computed statistics or metadata
Reversible: NaiveReverseGrouper undoes the batching with optional metadata export
Registered via @OPERATORS.register_module() and configured through YAML
Essential prerequisite for Aggregator operators that process related samples together

Implementations

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment