Implementation:Datajuicer Data juicer NestedAggregator
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Aggregation |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for recursively aggregating multiple document fragments into a single summary provided by Data-Juicer.
Description
NestedAggregator takes sub-documents from sample metadata (default: event_description), formats them using a sub-document template, splits them into groups that fit within token limits using avg_split_string_list_under_limit, and recursively calls the LLM (default: gpt-4o) to summarize each group. If the result is still a list of summaries, it recurses until a single summary remains, maintaining approximately the average length of the original fragments. Uses Chinese-language prompts with a "Journey to the West" example demonstrating the summarization style. This operator serves as a core aggregation building block used by other aggregators such as EntityAttributeAggregator to handle documents that exceed token limits.
Usage
Use when you need to aggregate multiple document fragments or text samples into a single coherent summary using a recursive map-reduce style summarization strategy, especially when the combined text exceeds LLM token limits.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/aggregator/nested_aggregator.py
Signature
@OPERATORS.register_module("nested_aggregator")
class NestedAggregator(Aggregator):
def __init__(self, api_model: str = "gpt-4o",
input_key: str = MetaKeys.event_description,
output_key: str = None,
max_token_num: Optional[PositiveInt] = None,
*, api_endpoint: Optional[str] = None,
response_path: Optional[str] = None,
system_prompt: Optional[str] = None,
sub_doc_template: Optional[str] = None,
input_template: Optional[str] = None,
try_num: PositiveInt = 3,
model_params: Dict = {},
sampling_params: Dict = {},
**kwargs):
Import
from data_juicer.ops.aggregator.nested_aggregator import NestedAggregator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| api_model | str | No | API model name. Default: "gpt-4o" |
| input_key | str | No | Input key in the meta field. Default: "event_description" |
| output_key | str | No | Output key in the aggregation field. Default: same as input_key |
| max_token_num | PositiveInt | No | Max total tokens for sub-documents. Default: None (unlimited) |
| api_endpoint | str | No | URL endpoint for the API |
| system_prompt | str | No | Custom system prompt for summarization |
| sub_doc_template | str | No | Template for input text in each sample |
| input_template | str | No | The input template for the LLM call |
| try_num | PositiveInt | No | Number of retry attempts. Default: 3 |
Outputs
| Name | Type | Description |
|---|---|---|
| sample[Fields.batch_meta][output_key] | str | Recursively summarized single document from all input fragments |
Usage Examples
process:
- nested_aggregator:
api_model: "gpt-4o"
max_token_num: 4096
try_num: 5