Implementation:Datajuicer Data juicer KeyValueGrouper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Grouping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for grouping samples into batches based on shared key values provided by Data-Juicer.
Description
KeyValueGrouper extends Grouper and groups dataset samples into batches based on shared values in one or more specified keys, supporting nested key access via dot notation (e.g., "__dj__stats__.text_len"). It iterates over all samples, extracts the values of the specified group_by_keys using nested_access for dotted paths, hashes the combined key-value dictionary via dict_to_hash, and groups samples with identical hashes together. Each group is converted from a list of dicts to a batched dict-of-lists via convert_list_dict_to_dict_list. If no keys are provided, it defaults to using the text key.
Usage
Use when you need to group data samples by specific attributes or features as a prerequisite for aggregation operations that process related samples together, such as grouping by entity, topic, or source before summarization.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/grouper/key_value_grouper.py
Signature
@OPERATORS.register_module("key_value_grouper")
class KeyValueGrouper(Grouper):
def __init__(self, group_by_keys: Optional[List[str]] = None,
*args, **kwargs):
Import
from data_juicer.ops.grouper.key_value_grouper import KeyValueGrouper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| group_by_keys | List[str] | No | Keys to group samples by. Supports nested keys like "__dj__stats__.text_len". Default: [text_key] |
Outputs
| Name | Type | Description |
|---|---|---|
| batched_samples | list of dict | List of batched samples where each batch contains samples sharing the same key values |
Usage Examples
process:
- key_value_grouper:
group_by_keys: ["category", "__dj__stats__.lang"]