Implementation:Datajuicer Data juicer MostRelevantEntitiesAggregator

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data_Processing, Aggregation
Last Updated	2026-02-14 16:00 GMT

Overview

Concrete tool for extracting and ranking entities most closely related to a given entity provided by Data-Juicer.

Description

MostRelevantEntitiesAggregator collects sub-documents from sample metadata (default: event_description), sends them to an LLM (default: gpt-4o) with a Chinese-language system prompt that instructs the model to identify related entities of a specified type, exclude same-type entities as the given one, and rank them by importance in descending order. The output is parsed via a regex pattern to extract a comma-separated list, which is further split by punctuation using split_text_by_punctuation. Results are stored in batch metadata under 'most_relevant_entities'. Supports optional token limits and retry logic.

Usage

Use when you need to identify and rank entities most closely related to a given entity from a document collection, for applications such as building knowledge graphs, character relationship maps, or entity-centric data organization.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/ops/aggregator/most_relevant_entities_aggregator.py

Signature

@OPERATORS.register_module("most_relevant_entities_aggregator")
class MostRelevantEntitiesAggregator(Aggregator):
    def __init__(self, api_model: str = "gpt-4o",
                 entity: str = None,
                 query_entity_type: str = None,
                 input_key: str = MetaKeys.event_description,
                 output_key: str = BatchMetaKeys.most_relevant_entities,
                 max_token_num: Optional[PositiveInt] = None,
                 *, api_endpoint: Optional[str] = None,
                 response_path: Optional[str] = None,
                 system_prompt_template: Optional[str] = None,
                 input_template: Optional[str] = None,
                 output_pattern: Optional[str] = None,
                 try_num: PositiveInt = 3,
                 model_params: Dict = {},
                 sampling_params: Dict = {},
                 **kwargs):

Import

from data_juicer.ops.aggregator.most_relevant_entities_aggregator import MostRelevantEntitiesAggregator

I/O Contract

Inputs

Name	Type	Required	Description
api_model	str	No	API model name. Default: "gpt-4o"
entity	str	Yes	The given entity to find related entities for
query_entity_type	str	Yes	The type of entities to query (e.g., "person", "location")
input_key	str	No	Input key in the meta field. Default: "event_description"
output_key	str	No	Output key in the aggregation field. Default: "most_relevant_entities"
max_token_num	PositiveInt	No	Max total tokens for sub-documents. Default: None (unlimited)
api_endpoint	str	No	URL endpoint for the API
try_num	PositiveInt	No	Number of retry attempts. Default: 3

Outputs

Name	Type	Description
sample[Fields.batch_meta][output_key]	list	Ranked list of most relevant entities sorted by importance in descending order

Usage Examples

process:
  - most_relevant_entities_aggregator:
      api_model: "gpt-4o"
      entity: "Sun Wukong"
      query_entity_type: "person"

Related Pages

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment