Principle:Datajuicer Data juicer Data Grouping
| Domains | Data_Processing, Data_Organization |
|---|---|
| Last Updated | 2026-02-14 17:00 GMT |
Overview
A batch assembly pattern that transforms individual dataset samples into grouped batches for downstream aggregate processing, and reverses the grouping back to individual samples afterward.
Pattern
Grouper operators extend the Grouper base class and transform datasets between per-sample and per-batch representations. The pattern supports three complementary operations:
1. Naive Grouping -- Combines all samples into a single batch, converting a list of dictionaries into a single dictionary of lists via convert_list_dict_to_dict_list. This is the simplest grouping for global aggregation.
2. Key-Value Grouping -- Groups samples by shared values in one or more specified fields (supporting nested key access via dot notation). Samples with identical key-value combinations are hashed together and placed in the same batch, enabling attribute-based partitioning.
3. Reverse Grouping -- The inverse operation that splits batched dict-of-lists back into individual dict records via convert_dict_list_to_list_dict, with optional export of batch-level metadata to JSON lines files.
Groupers serve as the bridge between per-sample operators (Filters, Mappers) and batch-level operators (Aggregators), forming a Group -> Aggregate -> Ungroup pipeline pattern.
Key Characteristics
- Transforms between per-sample and per-batch data representations
- Supports identity grouping (all samples in one batch) and attribute-based partitioning
- Nested key access via dot notation for grouping by computed statistics or metadata
- Reversible: NaiveReverseGrouper undoes the batching with optional metadata export
- Registered via
@OPERATORS.register_module()and configured through YAML - Essential prerequisite for Aggregator operators that process related samples together
Implementations
- Implementation:Datajuicer_Data_juicer_NaiveGrouper
- Implementation:Datajuicer_Data_juicer_NaiveReverseGrouper
- Implementation:Datajuicer_Data_juicer_KeyValueGrouper