Implementation:Datajuicer Data juicer KeyValueGrouper

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data_Processing, Grouping
Last Updated	2026-02-14 16:00 GMT

Overview

Concrete tool for grouping samples into batches based on shared key values provided by Data-Juicer.

Description

KeyValueGrouper extends Grouper and groups dataset samples into batches based on shared values in one or more specified keys, supporting nested key access via dot notation (e.g., "__dj__stats__.text_len"). It iterates over all samples, extracts the values of the specified group_by_keys using nested_access for dotted paths, hashes the combined key-value dictionary via dict_to_hash, and groups samples with identical hashes together. Each group is converted from a list of dicts to a batched dict-of-lists via convert_list_dict_to_dict_list. If no keys are provided, it defaults to using the text key.

Usage

Use when you need to group data samples by specific attributes or features as a prerequisite for aggregation operations that process related samples together, such as grouping by entity, topic, or source before summarization.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/ops/grouper/key_value_grouper.py

Signature

@OPERATORS.register_module("key_value_grouper")
class KeyValueGrouper(Grouper):
    def __init__(self, group_by_keys: Optional[List[str]] = None,
                 *args, **kwargs):

Import

from data_juicer.ops.grouper.key_value_grouper import KeyValueGrouper

I/O Contract

Inputs

Name	Type	Required	Description
group_by_keys	List[str]	No	Keys to group samples by. Supports nested keys like "__dj__stats__.text_len". Default: [text_key]

Outputs

Name	Type	Description
batched_samples	list of dict	List of batched samples where each batch contains samples sharing the same key values

Usage Examples

process:
  - key_value_grouper:
      group_by_keys: ["category", "__dj__stats__.lang"]

Related Pages

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment