Implementation:Datajuicer Data juicer OptimizeQAMapper Process
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Quality, LLM |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Concrete tool for optimizing question-answer pairs using LLM-based rewriting provided by the Data-Juicer framework.
Description
OptimizeQAMapper is a Mapper operator that rewrites QA pairs to improve their quality, complexity, and educational value. It supports both API-based and HuggingFace model backends, vLLM acceleration, configurable system prompts, and retry logic. The optimization prompt instructs the LLM to enhance questions and answers while preserving factual accuracy.
Usage
Use as an operator in a pipeline after generation and/or calibration steps. Can use local HuggingFace models or API endpoints.
Code Reference
Source Location
- Repository: data-juicer
- File: data_juicer/ops/mapper/optimize_qa_mapper.py
- Lines: L23-178
Signature
@OPERATORS.register_module('optimize_qa_mapper')
class OptimizeQAMapper(Mapper):
def __init__(
self,
api_or_hf_model: str = None,
is_hf_model: bool = True,
*,
system_prompt: str = None,
input_template: str = None,
output_pattern: str = None,
enable_vllm: bool = False,
model_params: dict = None,
sampling_params: dict = None,
try_num: PositiveInt = 3,
**kwargs
):
"""
Args:
api_or_hf_model: Model name/path.
is_hf_model: True for HuggingFace, False for API.
system_prompt: Optimization instructions.
input_template: Template for formatting input QA.
output_pattern: Regex for parsing optimized output.
enable_vllm: Use vLLM engine.
model_params: Model loading parameters.
sampling_params: Generation parameters.
try_num: Retry count.
"""
def process_single(self, sample):
"""
Optimize a single QA pair.
Args:
sample: Dict with query_key and response_key.
Returns:
sample with optimized query and response.
"""
Import
from data_juicer.ops.mapper.optimize_qa_mapper import OptimizeQAMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| api_or_hf_model | str | Yes | Model name/path for optimization |
| sample[query_key] | str | Yes | Question to optimize |
| sample[response_key] | str | Yes | Answer to optimize |
| enable_vllm | bool | No | Use vLLM engine (default: False) |
Outputs
| Name | Type | Description |
|---|---|---|
| sample[query_key] | str | Optimized question |
| sample[response_key] | str | Optimized answer |
Usage Examples
Full Generation Pipeline
process:
# Step 1: Generate QA from text
- generate_qa_from_text_mapper:
hf_model: Qwen/Qwen2.5-7B-Instruct
max_num: 3
# Step 2: Calibrate quality
- calibrate_qa_mapper:
api_model: gpt-4o
# Step 3: Optimize complexity
- optimize_qa_mapper:
api_or_hf_model: Qwen/Qwen2.5-7B-Instruct
is_hf_model: true
enable_vllm: true
sampling_params:
temperature: 0.8
# Step 4: Filter low-quality results
- text_length_filter:
min_len: 50