Implementation:Datajuicer Data juicer ExtractEntityRelationMapper
| Knowledge Sources | |
|---|---|
| Domains | NLP, Knowledge Graph, Entity Extraction |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Extracts entities and their relationships from text to build knowledge graph structures, using an API-based language model with a structured prompt template adapted from the LightRAG project.
Description
ExtractEntityRelationMapper is a core knowledge graph construction operator that transforms unstructured text into structured entity-relation triples. It uses a detailed prompt template (adapted from LightRAG) that guides an API model through a structured extraction process:
- Entity Identification -- Identifies entities with configurable types (default: organization, person, geo, event) and extracts their names, types, and descriptions
- Relationship Extraction -- Identifies related entity pairs with relationship descriptions, keywords, and numeric strength scores
- Gleaning -- Supports multiple extraction rounds (configurable via max_gleaning) to ensure comprehensive extraction, with an if-loop prompt to determine when to stop
- Structured Parsing -- Uses configurable delimiters (tuple, record, completion) and regex patterns to parse the structured output
The prompt template includes detailed instructions with multi-language examples (English and Chinese). Output entities and relations are cached in the sample's metadata under configurable keys (default: entity and relation).
The operator uses an API model (default: gpt-4o) and supports configurable endpoints, response paths, and sampling parameters.
Usage
Use this operator when building knowledge graphs from text data, or when structured entity-relation information is needed for downstream tasks such as retrieval-augmented generation, graph-based analysis, or knowledge-intensive data processing.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/extract_entity_relation_mapper.py
- Lines: 1-359
Signature
class ExtractEntityRelationMapper(Mapper):
def __init__(
self,
api_model: str = "gpt-4o",
entity_types: List[str] = None,
*,
entity_key: str = MetaKeys.entity,
relation_key: str = MetaKeys.relation,
api_endpoint: Optional[str] = None,
response_path: Optional[str] = None,
prompt_template: Optional[str] = None,
tuple_delimiter: Optional[str] = None,
record_delimiter: Optional[str] = None,
completion_delimiter: Optional[str] = None,
max_gleaning: NonNegativeInt = 1,
continue_prompt: Optional[str] = None,
if_loop_prompt: Optional[str] = None,
entity_pattern: Optional[str] = None,
relation_pattern: Optional[str] = None,
try_num: PositiveInt = 3,
drop_text: bool = False,
model_params: Dict = {},
sampling_params: Dict = {},
**kwargs,
):
Import
from data_juicer.ops.mapper.extract_entity_relation_mapper import ExtractEntityRelationMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| api_model | str | No | API model name. Default: "gpt-4o" |
| entity_types | List[str] | No | Pre-defined entity types. Default: ["organization", "person", "geo", "event"] |
| entity_key | str | No | Metadata key for storing entities. Default: MetaKeys.entity |
| relation_key | str | No | Metadata key for storing relations. Default: MetaKeys.relation |
| api_endpoint | str | No | URL endpoint for the API |
| response_path | str | No | Path to extract content from API response |
| prompt_template | str | No | Custom input prompt template |
| max_gleaning | int | No | Extra max iterations to glean entities and relations. Default: 1 |
| try_num | int | No | Number of retry attempts on error. Default: 3 |
| drop_text | bool | No | Whether to drop the text in output. Default: False |
| model_params | Dict | No | Parameters for initializing the API model |
| sampling_params | Dict | No | Extra parameters for API calls (e.g., temperature, top_p) |
Outputs
| Name | Type | Description |
|---|---|---|
| sample[Fields.meta][entity_key] | list[dict] | List of entity dicts with keys: entity_name, entity_type, entity_description |
| sample[Fields.meta][relation_key] | list[dict] | List of relation dicts with keys: source_entity, target_entity, relation_description, relation_keywords, relation_strength |
Usage Examples
# Basic usage with default model
mapper = ExtractEntityRelationMapper(
api_model="gpt-4o",
entity_types=["organization", "person", "geo", "event"],
)
# With custom entity types and more gleaning rounds
mapper = ExtractEntityRelationMapper(
api_model="gpt-4o",
entity_types=["person", "technology", "organization", "location", "concept"],
max_gleaning=3,
try_num=5,
sampling_params={"temperature": 0.7},
)
# Process a sample
sample = {"text": "Alex works at OpenAI in San Francisco.", Fields.meta: {}}
result = mapper.process_single(sample)
# result[Fields.meta]["entity"] contains extracted entities
# result[Fields.meta]["relation"] contains extracted relations