Implementation:Datajuicer Data juicer EntityAttributeAggregator

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data_Processing, Aggregation
Last Updated	2026-02-14 16:00 GMT

Overview

Concrete tool for summarizing a specific attribute of an entity from documents provided by Data-Juicer.

Description

EntityAttributeAggregator aggregates and summarizes a specific attribute of a given entity from a collection of related documents using an LLM API. It collects sub-documents from the input_key (default: event_description) in sample metadata, splits them to fit within token limits using avg_split_string_list_under_limit, then calls NestedAggregator to recursively summarize if needed. The final summary is generated by an LLM (default: gpt-4o) guided by a system prompt with Chinese-language templates specifying the entity and attribute, with output parsed via a regex pattern and stored in batch metadata under the output_key. Supports configurable word limits, retry logic, and customizable prompts.

Usage

Use when you need to extract and summarize a specific attribute (such as background, personality, or relationships) of a particular entity from a collection of related documents for building structured knowledge bases or character profiles.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/ops/aggregator/entity_attribute_aggregator.py

Signature

@OPERATORS.register_module("entity_attribute_aggregator")
class EntityAttributeAggregator(Aggregator):
    def __init__(self, api_model: str = "gpt-4o",
                 entity: str = None,
                 attribute: str = None,
                 input_key: str = MetaKeys.event_description,
                 output_key: str = BatchMetaKeys.entity_attribute,
                 word_limit: PositiveInt = 100,
                 max_token_num: Optional[PositiveInt] = None,
                 *, api_endpoint: Optional[str] = None,
                 response_path: Optional[str] = None,
                 system_prompt_template: Optional[str] = None,
                 example_prompt: Optional[str] = None,
                 input_template: Optional[str] = None,
                 output_pattern_template: Optional[str] = None,
                 try_num: PositiveInt = 3,
                 model_params: Dict = {},
                 sampling_params: Dict = {},
                 **kwargs):

Import

from data_juicer.ops.aggregator.entity_attribute_aggregator import EntityAttributeAggregator

I/O Contract

Inputs

Name	Type	Required	Description
api_model	str	No	API model name. Default: "gpt-4o"
entity	str	Yes	The entity whose attribute is to be summarized
attribute	str	Yes	The attribute of the entity to summarize
input_key	str	No	Input key in the meta field. Default: "event_description"
output_key	str	No	Output key in the aggregation field. Default: "entity_attribute"
word_limit	PositiveInt	No	Prompted output length limit. Default: 100
max_token_num	PositiveInt	No	Max total tokens for sub-documents. Default: None (unlimited)
api_endpoint	str	No	URL endpoint for the API
try_num	PositiveInt	No	Number of retry attempts. Default: 3

Outputs

Name	Type	Description
sample[Fields.batch_meta][output_key]	str	Markdown-formatted summary of the entity's attribute

Usage Examples

process:
  - entity_attribute_aggregator:
      api_model: "gpt-4o"
      entity: "Sun Wukong"
      attribute: "background"
      word_limit: 200

Related Pages

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment