Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer KeyValueGrouper

From Leeroopedia
Knowledge Sources
Domains Data_Processing, Grouping
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for grouping samples into batches based on shared key values provided by Data-Juicer.

Description

KeyValueGrouper extends Grouper and groups dataset samples into batches based on shared values in one or more specified keys, supporting nested key access via dot notation (e.g., "__dj__stats__.text_len"). It iterates over all samples, extracts the values of the specified group_by_keys using nested_access for dotted paths, hashes the combined key-value dictionary via dict_to_hash, and groups samples with identical hashes together. Each group is converted from a list of dicts to a batched dict-of-lists via convert_list_dict_to_dict_list. If no keys are provided, it defaults to using the text key.

Usage

Use when you need to group data samples by specific attributes or features as a prerequisite for aggregation operations that process related samples together, such as grouping by entity, topic, or source before summarization.

Code Reference

Source Location

Signature

@OPERATORS.register_module("key_value_grouper")
class KeyValueGrouper(Grouper):
    def __init__(self, group_by_keys: Optional[List[str]] = None,
                 *args, **kwargs):

Import

from data_juicer.ops.grouper.key_value_grouper import KeyValueGrouper

I/O Contract

Inputs

Name Type Required Description
group_by_keys List[str] No Keys to group samples by. Supports nested keys like "__dj__stats__.text_len". Default: [text_key]

Outputs

Name Type Description
batched_samples list of dict List of batched samples where each batch contains samples sharing the same key values

Usage Examples

process:
  - key_value_grouper:
      group_by_keys: ["category", "__dj__stats__.lang"]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment