Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer NestedAggregator

From Leeroopedia
Knowledge Sources
Domains Data_Processing, Aggregation
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for recursively aggregating multiple document fragments into a single summary provided by Data-Juicer.

Description

NestedAggregator takes sub-documents from sample metadata (default: event_description), formats them using a sub-document template, splits them into groups that fit within token limits using avg_split_string_list_under_limit, and recursively calls the LLM (default: gpt-4o) to summarize each group. If the result is still a list of summaries, it recurses until a single summary remains, maintaining approximately the average length of the original fragments. Uses Chinese-language prompts with a "Journey to the West" example demonstrating the summarization style. This operator serves as a core aggregation building block used by other aggregators such as EntityAttributeAggregator to handle documents that exceed token limits.

Usage

Use when you need to aggregate multiple document fragments or text samples into a single coherent summary using a recursive map-reduce style summarization strategy, especially when the combined text exceeds LLM token limits.

Code Reference

Source Location

Signature

@OPERATORS.register_module("nested_aggregator")
class NestedAggregator(Aggregator):
    def __init__(self, api_model: str = "gpt-4o",
                 input_key: str = MetaKeys.event_description,
                 output_key: str = None,
                 max_token_num: Optional[PositiveInt] = None,
                 *, api_endpoint: Optional[str] = None,
                 response_path: Optional[str] = None,
                 system_prompt: Optional[str] = None,
                 sub_doc_template: Optional[str] = None,
                 input_template: Optional[str] = None,
                 try_num: PositiveInt = 3,
                 model_params: Dict = {},
                 sampling_params: Dict = {},
                 **kwargs):

Import

from data_juicer.ops.aggregator.nested_aggregator import NestedAggregator

I/O Contract

Inputs

Name Type Required Description
api_model str No API model name. Default: "gpt-4o"
input_key str No Input key in the meta field. Default: "event_description"
output_key str No Output key in the aggregation field. Default: same as input_key
max_token_num PositiveInt No Max total tokens for sub-documents. Default: None (unlimited)
api_endpoint str No URL endpoint for the API
system_prompt str No Custom system prompt for summarization
sub_doc_template str No Template for input text in each sample
input_template str No The input template for the LLM call
try_num PositiveInt No Number of retry attempts. Default: 3

Outputs

Name Type Description
sample[Fields.batch_meta][output_key] str Recursively summarized single document from all input fragments

Usage Examples

process:
  - nested_aggregator:
      api_model: "gpt-4o"
      max_token_num: 4096
      try_num: 5

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment