Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer ExtractEntityRelationMapper

From Leeroopedia
Knowledge Sources
Domains NLP, Knowledge Graph, Entity Extraction
Last Updated 2026-02-14 16:00 GMT

Overview

Extracts entities and their relationships from text to build knowledge graph structures, using an API-based language model with a structured prompt template adapted from the LightRAG project.

Description

ExtractEntityRelationMapper is a core knowledge graph construction operator that transforms unstructured text into structured entity-relation triples. It uses a detailed prompt template (adapted from LightRAG) that guides an API model through a structured extraction process:

  1. Entity Identification -- Identifies entities with configurable types (default: organization, person, geo, event) and extracts their names, types, and descriptions
  2. Relationship Extraction -- Identifies related entity pairs with relationship descriptions, keywords, and numeric strength scores
  3. Gleaning -- Supports multiple extraction rounds (configurable via max_gleaning) to ensure comprehensive extraction, with an if-loop prompt to determine when to stop
  4. Structured Parsing -- Uses configurable delimiters (tuple, record, completion) and regex patterns to parse the structured output

The prompt template includes detailed instructions with multi-language examples (English and Chinese). Output entities and relations are cached in the sample's metadata under configurable keys (default: entity and relation).

The operator uses an API model (default: gpt-4o) and supports configurable endpoints, response paths, and sampling parameters.

Usage

Use this operator when building knowledge graphs from text data, or when structured entity-relation information is needed for downstream tasks such as retrieval-augmented generation, graph-based analysis, or knowledge-intensive data processing.

Code Reference

Source Location

  • Repository: Datajuicer_Data_juicer
  • File: data_juicer/ops/mapper/extract_entity_relation_mapper.py
  • Lines: 1-359

Signature

class ExtractEntityRelationMapper(Mapper):
    def __init__(
        self,
        api_model: str = "gpt-4o",
        entity_types: List[str] = None,
        *,
        entity_key: str = MetaKeys.entity,
        relation_key: str = MetaKeys.relation,
        api_endpoint: Optional[str] = None,
        response_path: Optional[str] = None,
        prompt_template: Optional[str] = None,
        tuple_delimiter: Optional[str] = None,
        record_delimiter: Optional[str] = None,
        completion_delimiter: Optional[str] = None,
        max_gleaning: NonNegativeInt = 1,
        continue_prompt: Optional[str] = None,
        if_loop_prompt: Optional[str] = None,
        entity_pattern: Optional[str] = None,
        relation_pattern: Optional[str] = None,
        try_num: PositiveInt = 3,
        drop_text: bool = False,
        model_params: Dict = {},
        sampling_params: Dict = {},
        **kwargs,
    ):

Import

from data_juicer.ops.mapper.extract_entity_relation_mapper import ExtractEntityRelationMapper

I/O Contract

Inputs

Name Type Required Description
api_model str No API model name. Default: "gpt-4o"
entity_types List[str] No Pre-defined entity types. Default: ["organization", "person", "geo", "event"]
entity_key str No Metadata key for storing entities. Default: MetaKeys.entity
relation_key str No Metadata key for storing relations. Default: MetaKeys.relation
api_endpoint str No URL endpoint for the API
response_path str No Path to extract content from API response
prompt_template str No Custom input prompt template
max_gleaning int No Extra max iterations to glean entities and relations. Default: 1
try_num int No Number of retry attempts on error. Default: 3
drop_text bool No Whether to drop the text in output. Default: False
model_params Dict No Parameters for initializing the API model
sampling_params Dict No Extra parameters for API calls (e.g., temperature, top_p)

Outputs

Name Type Description
sample[Fields.meta][entity_key] list[dict] List of entity dicts with keys: entity_name, entity_type, entity_description
sample[Fields.meta][relation_key] list[dict] List of relation dicts with keys: source_entity, target_entity, relation_description, relation_keywords, relation_strength

Usage Examples

# Basic usage with default model
mapper = ExtractEntityRelationMapper(
    api_model="gpt-4o",
    entity_types=["organization", "person", "geo", "event"],
)

# With custom entity types and more gleaning rounds
mapper = ExtractEntityRelationMapper(
    api_model="gpt-4o",
    entity_types=["person", "technology", "organization", "location", "concept"],
    max_gleaning=3,
    try_num=5,
    sampling_params={"temperature": 0.7},
)

# Process a sample
sample = {"text": "Alex works at OpenAI in San Francisco.", Fields.meta: {}}
result = mapper.process_single(sample)
# result[Fields.meta]["entity"] contains extracted entities
# result[Fields.meta]["relation"] contains extracted relations

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment