Implementation:Datajuicer Data juicer MostRelevantEntitiesAggregator
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Aggregation |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for extracting and ranking entities most closely related to a given entity provided by Data-Juicer.
Description
MostRelevantEntitiesAggregator collects sub-documents from sample metadata (default: event_description), sends them to an LLM (default: gpt-4o) with a Chinese-language system prompt that instructs the model to identify related entities of a specified type, exclude same-type entities as the given one, and rank them by importance in descending order. The output is parsed via a regex pattern to extract a comma-separated list, which is further split by punctuation using split_text_by_punctuation. Results are stored in batch metadata under 'most_relevant_entities'. Supports optional token limits and retry logic.
Usage
Use when you need to identify and rank entities most closely related to a given entity from a document collection, for applications such as building knowledge graphs, character relationship maps, or entity-centric data organization.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/aggregator/most_relevant_entities_aggregator.py
Signature
@OPERATORS.register_module("most_relevant_entities_aggregator")
class MostRelevantEntitiesAggregator(Aggregator):
def __init__(self, api_model: str = "gpt-4o",
entity: str = None,
query_entity_type: str = None,
input_key: str = MetaKeys.event_description,
output_key: str = BatchMetaKeys.most_relevant_entities,
max_token_num: Optional[PositiveInt] = None,
*, api_endpoint: Optional[str] = None,
response_path: Optional[str] = None,
system_prompt_template: Optional[str] = None,
input_template: Optional[str] = None,
output_pattern: Optional[str] = None,
try_num: PositiveInt = 3,
model_params: Dict = {},
sampling_params: Dict = {},
**kwargs):
Import
from data_juicer.ops.aggregator.most_relevant_entities_aggregator import MostRelevantEntitiesAggregator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| api_model | str | No | API model name. Default: "gpt-4o" |
| entity | str | Yes | The given entity to find related entities for |
| query_entity_type | str | Yes | The type of entities to query (e.g., "person", "location") |
| input_key | str | No | Input key in the meta field. Default: "event_description" |
| output_key | str | No | Output key in the aggregation field. Default: "most_relevant_entities" |
| max_token_num | PositiveInt | No | Max total tokens for sub-documents. Default: None (unlimited) |
| api_endpoint | str | No | URL endpoint for the API |
| try_num | PositiveInt | No | Number of retry attempts. Default: 3 |
Outputs
| Name | Type | Description |
|---|---|---|
| sample[Fields.batch_meta][output_key] | list | Ranked list of most relevant entities sorted by importance in descending order |
Usage Examples
process:
- most_relevant_entities_aggregator:
api_model: "gpt-4o"
entity: "Sun Wukong"
query_entity_type: "person"