Principle:Datajuicer Data juicer Data Aggregation
| Domains | Data_Processing, LLM |
|---|---|
| Last Updated | 2026-02-14 17:00 GMT |
Overview
An LLM-driven batch processing pattern that summarizes, consolidates, or extracts structured information from groups of related samples, operating on batched data produced by Grouper operators.
Pattern
Aggregator operators extend the Aggregator base class and process batched samples (groups of related records) rather than individual samples. The pattern follows:
1. Input Collection -- Read sub-documents from sample metadata (typically from a configurable input_key in the meta field), collecting all text fragments from the batch.
2. Token-Aware Splitting -- Split collected documents into groups that fit within LLM token limits using avg_split_string_list_under_limit, enabling processing of arbitrarily large document collections.
3. LLM-Based Processing -- Send formatted prompts to an LLM API (default: gpt-4o) with task-specific system prompts. The LLM performs summarization, entity extraction, tag consolidation, or relationship ranking depending on the specific aggregator.
4. Output Parsing -- Parse LLM responses using regex patterns to extract structured output, with configurable retry logic for handling malformed responses.
5. Result Storage -- Store results in batch metadata under a configurable output_key.
All aggregators use the @OPERATORS.register_module() decorator, support configurable API models/endpoints, and use Chinese-language prompt templates as defaults.
Key Characteristics
- Operates on batched samples (output of Grouper operators), not individual records
- LLM-powered analysis with configurable API model and endpoint
- Token-aware document splitting for handling large input collections
- Recursive map-reduce summarization when content exceeds token limits
- Regex-based output parsing with retry logic
- Configurable prompt templates (system prompt, input template, output pattern)
- Results stored in batch metadata for downstream processing
Implementations
- Implementation:Datajuicer_Data_juicer_NestedAggregator
- Implementation:Datajuicer_Data_juicer_EntityAttributeAggregator
- Implementation:Datajuicer_Data_juicer_MetaTagsAggregator
- Implementation:Datajuicer_Data_juicer_MostRelevantEntitiesAggregator