Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer MostRelevantEntitiesAggregator

From Leeroopedia
Knowledge Sources
Domains Data_Processing, Aggregation
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for extracting and ranking entities most closely related to a given entity provided by Data-Juicer.

Description

MostRelevantEntitiesAggregator collects sub-documents from sample metadata (default: event_description), sends them to an LLM (default: gpt-4o) with a Chinese-language system prompt that instructs the model to identify related entities of a specified type, exclude same-type entities as the given one, and rank them by importance in descending order. The output is parsed via a regex pattern to extract a comma-separated list, which is further split by punctuation using split_text_by_punctuation. Results are stored in batch metadata under 'most_relevant_entities'. Supports optional token limits and retry logic.

Usage

Use when you need to identify and rank entities most closely related to a given entity from a document collection, for applications such as building knowledge graphs, character relationship maps, or entity-centric data organization.

Code Reference

Source Location

Signature

@OPERATORS.register_module("most_relevant_entities_aggregator")
class MostRelevantEntitiesAggregator(Aggregator):
    def __init__(self, api_model: str = "gpt-4o",
                 entity: str = None,
                 query_entity_type: str = None,
                 input_key: str = MetaKeys.event_description,
                 output_key: str = BatchMetaKeys.most_relevant_entities,
                 max_token_num: Optional[PositiveInt] = None,
                 *, api_endpoint: Optional[str] = None,
                 response_path: Optional[str] = None,
                 system_prompt_template: Optional[str] = None,
                 input_template: Optional[str] = None,
                 output_pattern: Optional[str] = None,
                 try_num: PositiveInt = 3,
                 model_params: Dict = {},
                 sampling_params: Dict = {},
                 **kwargs):

Import

from data_juicer.ops.aggregator.most_relevant_entities_aggregator import MostRelevantEntitiesAggregator

I/O Contract

Inputs

Name Type Required Description
api_model str No API model name. Default: "gpt-4o"
entity str Yes The given entity to find related entities for
query_entity_type str Yes The type of entities to query (e.g., "person", "location")
input_key str No Input key in the meta field. Default: "event_description"
output_key str No Output key in the aggregation field. Default: "most_relevant_entities"
max_token_num PositiveInt No Max total tokens for sub-documents. Default: None (unlimited)
api_endpoint str No URL endpoint for the API
try_num PositiveInt No Number of retry attempts. Default: 3

Outputs

Name Type Description
sample[Fields.batch_meta][output_key] list Ranked list of most relevant entities sorted by importance in descending order

Usage Examples

process:
  - most_relevant_entities_aggregator:
      api_model: "gpt-4o"
      entity: "Sun Wukong"
      query_entity_type: "person"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment