Principle:Datajuicer Data juicer Data Aggregation

Domains	Data_Processing, LLM
Last Updated	2026-02-14 17:00 GMT

Overview

An LLM-driven batch processing pattern that summarizes, consolidates, or extracts structured information from groups of related samples, operating on batched data produced by Grouper operators.

Pattern

Aggregator operators extend the Aggregator base class and process batched samples (groups of related records) rather than individual samples. The pattern follows:

1. Input Collection -- Read sub-documents from sample metadata (typically from a configurable input_key in the meta field), collecting all text fragments from the batch.

2. Token-Aware Splitting -- Split collected documents into groups that fit within LLM token limits using avg_split_string_list_under_limit, enabling processing of arbitrarily large document collections.

3. LLM-Based Processing -- Send formatted prompts to an LLM API (default: gpt-4o) with task-specific system prompts. The LLM performs summarization, entity extraction, tag consolidation, or relationship ranking depending on the specific aggregator.

4. Output Parsing -- Parse LLM responses using regex patterns to extract structured output, with configurable retry logic for handling malformed responses.

5. Result Storage -- Store results in batch metadata under a configurable output_key.

All aggregators use the @OPERATORS.register_module() decorator, support configurable API models/endpoints, and use Chinese-language prompt templates as defaults.

Key Characteristics

Operates on batched samples (output of Grouper operators), not individual records
LLM-powered analysis with configurable API model and endpoint
Token-aware document splitting for handling large input collections
Recursive map-reduce summarization when content exceeds token limits
Regex-based output parsing with retry logic
Configurable prompt templates (system prompt, input template, output pattern)
Results stored in batch metadata for downstream processing

Implementations

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment