Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer EntityAttributeAggregator

From Leeroopedia
Knowledge Sources
Domains Data_Processing, Aggregation
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for summarizing a specific attribute of an entity from documents provided by Data-Juicer.

Description

EntityAttributeAggregator aggregates and summarizes a specific attribute of a given entity from a collection of related documents using an LLM API. It collects sub-documents from the input_key (default: event_description) in sample metadata, splits them to fit within token limits using avg_split_string_list_under_limit, then calls NestedAggregator to recursively summarize if needed. The final summary is generated by an LLM (default: gpt-4o) guided by a system prompt with Chinese-language templates specifying the entity and attribute, with output parsed via a regex pattern and stored in batch metadata under the output_key. Supports configurable word limits, retry logic, and customizable prompts.

Usage

Use when you need to extract and summarize a specific attribute (such as background, personality, or relationships) of a particular entity from a collection of related documents for building structured knowledge bases or character profiles.

Code Reference

Source Location

Signature

@OPERATORS.register_module("entity_attribute_aggregator")
class EntityAttributeAggregator(Aggregator):
    def __init__(self, api_model: str = "gpt-4o",
                 entity: str = None,
                 attribute: str = None,
                 input_key: str = MetaKeys.event_description,
                 output_key: str = BatchMetaKeys.entity_attribute,
                 word_limit: PositiveInt = 100,
                 max_token_num: Optional[PositiveInt] = None,
                 *, api_endpoint: Optional[str] = None,
                 response_path: Optional[str] = None,
                 system_prompt_template: Optional[str] = None,
                 example_prompt: Optional[str] = None,
                 input_template: Optional[str] = None,
                 output_pattern_template: Optional[str] = None,
                 try_num: PositiveInt = 3,
                 model_params: Dict = {},
                 sampling_params: Dict = {},
                 **kwargs):

Import

from data_juicer.ops.aggregator.entity_attribute_aggregator import EntityAttributeAggregator

I/O Contract

Inputs

Name Type Required Description
api_model str No API model name. Default: "gpt-4o"
entity str Yes The entity whose attribute is to be summarized
attribute str Yes The attribute of the entity to summarize
input_key str No Input key in the meta field. Default: "event_description"
output_key str No Output key in the aggregation field. Default: "entity_attribute"
word_limit PositiveInt No Prompted output length limit. Default: 100
max_token_num PositiveInt No Max total tokens for sub-documents. Default: None (unlimited)
api_endpoint str No URL endpoint for the API
try_num PositiveInt No Number of retry attempts. Default: 3

Outputs

Name Type Description
sample[Fields.batch_meta][output_key] str Markdown-formatted summary of the entity's attribute

Usage Examples

process:
  - entity_attribute_aggregator:
      api_model: "gpt-4o"
      entity: "Sun Wukong"
      attribute: "background"
      word_limit: 200

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment