Implementation:Datajuicer Data juicer NestedAggregator

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data_Processing, Aggregation
Last Updated	2026-02-14 16:00 GMT

Overview

Concrete tool for recursively aggregating multiple document fragments into a single summary provided by Data-Juicer.

Description

NestedAggregator takes sub-documents from sample metadata (default: event_description), formats them using a sub-document template, splits them into groups that fit within token limits using avg_split_string_list_under_limit, and recursively calls the LLM (default: gpt-4o) to summarize each group. If the result is still a list of summaries, it recurses until a single summary remains, maintaining approximately the average length of the original fragments. Uses Chinese-language prompts with a "Journey to the West" example demonstrating the summarization style. This operator serves as a core aggregation building block used by other aggregators such as EntityAttributeAggregator to handle documents that exceed token limits.

Usage

Use when you need to aggregate multiple document fragments or text samples into a single coherent summary using a recursive map-reduce style summarization strategy, especially when the combined text exceeds LLM token limits.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/ops/aggregator/nested_aggregator.py

Signature

@OPERATORS.register_module("nested_aggregator")
class NestedAggregator(Aggregator):
    def __init__(self, api_model: str = "gpt-4o",
                 input_key: str = MetaKeys.event_description,
                 output_key: str = None,
                 max_token_num: Optional[PositiveInt] = None,
                 *, api_endpoint: Optional[str] = None,
                 response_path: Optional[str] = None,
                 system_prompt: Optional[str] = None,
                 sub_doc_template: Optional[str] = None,
                 input_template: Optional[str] = None,
                 try_num: PositiveInt = 3,
                 model_params: Dict = {},
                 sampling_params: Dict = {},
                 **kwargs):

Import

from data_juicer.ops.aggregator.nested_aggregator import NestedAggregator

I/O Contract

Inputs

Name	Type	Required	Description
api_model	str	No	API model name. Default: "gpt-4o"
input_key	str	No	Input key in the meta field. Default: "event_description"
output_key	str	No	Output key in the aggregation field. Default: same as input_key
max_token_num	PositiveInt	No	Max total tokens for sub-documents. Default: None (unlimited)
api_endpoint	str	No	URL endpoint for the API
system_prompt	str	No	Custom system prompt for summarization
sub_doc_template	str	No	Template for input text in each sample
input_template	str	No	The input template for the LLM call
try_num	PositiveInt	No	Number of retry attempts. Default: 3

Outputs

Name	Type	Description
sample[Fields.batch_meta][output_key]	str	Recursively summarized single document from all input fragments

Usage Examples

process:
  - nested_aggregator:
      api_model: "gpt-4o"
      max_token_num: 4096
      try_num: 5

Related Pages

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment