Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Datajuicer Data juicer Data Aggregation

From Leeroopedia
Domains Data_Processing, LLM
Last Updated 2026-02-14 17:00 GMT

Overview

An LLM-driven batch processing pattern that summarizes, consolidates, or extracts structured information from groups of related samples, operating on batched data produced by Grouper operators.

Pattern

Aggregator operators extend the Aggregator base class and process batched samples (groups of related records) rather than individual samples. The pattern follows:

1. Input Collection -- Read sub-documents from sample metadata (typically from a configurable input_key in the meta field), collecting all text fragments from the batch.

2. Token-Aware Splitting -- Split collected documents into groups that fit within LLM token limits using avg_split_string_list_under_limit, enabling processing of arbitrarily large document collections.

3. LLM-Based Processing -- Send formatted prompts to an LLM API (default: gpt-4o) with task-specific system prompts. The LLM performs summarization, entity extraction, tag consolidation, or relationship ranking depending on the specific aggregator.

4. Output Parsing -- Parse LLM responses using regex patterns to extract structured output, with configurable retry logic for handling malformed responses.

5. Result Storage -- Store results in batch metadata under a configurable output_key.

All aggregators use the @OPERATORS.register_module() decorator, support configurable API models/endpoints, and use Chinese-language prompt templates as defaults.

Key Characteristics

  • Operates on batched samples (output of Grouper operators), not individual records
  • LLM-powered analysis with configurable API model and endpoint
  • Token-aware document splitting for handling large input collections
  • Recursive map-reduce summarization when content exceeds token limits
  • Regex-based output parsing with retry logic
  • Configurable prompt templates (system prompt, input template, output pattern)
  • Results stored in batch metadata for downstream processing

Implementations

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment