Implementation:Datajuicer Data juicer ExtractSupportTextMapper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for extracting supporting sub-text from original text based on a given summary provided by Data-Juicer.
Description
ExtractSupportTextMapper is a mapper operator that uses an API-based language model to identify and extract the segment of original text that best matches a provided summary. It sends the original text and a summary (from the event_description metadata key by default) to the model with a Chinese system prompt demonstrating the excerpt extraction task. If extraction fails or returns empty, the original summary is used as a fallback. Results are stored under the support_text metadata key.
Usage
Use when you need to link summaries or event descriptions back to their source text, providing evidence-based support for extracted information and enabling traceability in text analysis pipelines.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/extract_support_text_mapper.py
Signature
@OPERATORS.register_module("extract_support_text_mapper")
class ExtractSupportTextMapper(Mapper):
def __init__(self,
api_model: str = "gpt-4o",
*,
summary_key: str = MetaKeys.event_description,
support_text_key: str = MetaKeys.support_text,
api_endpoint: Optional[str] = None,
response_path: Optional[str] = None,
system_prompt: Optional[str] = None,
input_template: Optional[str] = None,
try_num: PositiveInt = 3,
drop_text: bool = False,
model_params: Dict = {},
sampling_params: Dict = {},
**kwargs):
Import
from data_juicer.ops.mapper.extract_support_text_mapper import ExtractSupportTextMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| api_model | str | No | API model name, defaults to "gpt-4o" |
| summary_key | str | No | Key name for input summary in meta field, defaults to MetaKeys.event_description |
| support_text_key | str | No | Key name to store output support text in meta field, defaults to MetaKeys.support_text |
| api_endpoint | Optional[str] | No | URL endpoint for the API |
| response_path | Optional[str] | No | Path to extract content from API response |
| system_prompt | Optional[str] | No | System prompt for the task |
| input_template | Optional[str] | No | Template for building the model input |
| try_num | PositiveInt | No | Number of retry attempts on error, defaults to 3 |
| drop_text | bool | No | Whether to drop text from output, defaults to False |
| model_params | Dict | No | Parameters for initializing the API model |
| sampling_params | Dict | No | Extra parameters passed to API call |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Transformed samples with support text stored in meta field |
Usage Examples
process:
- extract_support_text_mapper:
api_model: "gpt-4o"
try_num: 3
drop_text: false