Implementation:Datajuicer Data juicer EntityAttributeAggregator
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Aggregation |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for summarizing a specific attribute of an entity from documents provided by Data-Juicer.
Description
EntityAttributeAggregator aggregates and summarizes a specific attribute of a given entity from a collection of related documents using an LLM API. It collects sub-documents from the input_key (default: event_description) in sample metadata, splits them to fit within token limits using avg_split_string_list_under_limit, then calls NestedAggregator to recursively summarize if needed. The final summary is generated by an LLM (default: gpt-4o) guided by a system prompt with Chinese-language templates specifying the entity and attribute, with output parsed via a regex pattern and stored in batch metadata under the output_key. Supports configurable word limits, retry logic, and customizable prompts.
Usage
Use when you need to extract and summarize a specific attribute (such as background, personality, or relationships) of a particular entity from a collection of related documents for building structured knowledge bases or character profiles.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/aggregator/entity_attribute_aggregator.py
Signature
@OPERATORS.register_module("entity_attribute_aggregator")
class EntityAttributeAggregator(Aggregator):
def __init__(self, api_model: str = "gpt-4o",
entity: str = None,
attribute: str = None,
input_key: str = MetaKeys.event_description,
output_key: str = BatchMetaKeys.entity_attribute,
word_limit: PositiveInt = 100,
max_token_num: Optional[PositiveInt] = None,
*, api_endpoint: Optional[str] = None,
response_path: Optional[str] = None,
system_prompt_template: Optional[str] = None,
example_prompt: Optional[str] = None,
input_template: Optional[str] = None,
output_pattern_template: Optional[str] = None,
try_num: PositiveInt = 3,
model_params: Dict = {},
sampling_params: Dict = {},
**kwargs):
Import
from data_juicer.ops.aggregator.entity_attribute_aggregator import EntityAttributeAggregator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| api_model | str | No | API model name. Default: "gpt-4o" |
| entity | str | Yes | The entity whose attribute is to be summarized |
| attribute | str | Yes | The attribute of the entity to summarize |
| input_key | str | No | Input key in the meta field. Default: "event_description" |
| output_key | str | No | Output key in the aggregation field. Default: "entity_attribute" |
| word_limit | PositiveInt | No | Prompted output length limit. Default: 100 |
| max_token_num | PositiveInt | No | Max total tokens for sub-documents. Default: None (unlimited) |
| api_endpoint | str | No | URL endpoint for the API |
| try_num | PositiveInt | No | Number of retry attempts. Default: 3 |
Outputs
| Name | Type | Description |
|---|---|---|
| sample[Fields.batch_meta][output_key] | str | Markdown-formatted summary of the entity's attribute |
Usage Examples
process:
- entity_attribute_aggregator:
api_model: "gpt-4o"
entity: "Sun Wukong"
attribute: "background"
word_limit: 200