Implementation:Datajuicer Data juicer OptimizeQAMapper Process

Knowledge Sources	Data-Juicer
Domains	NLP, Data_Quality, LLM
Last Updated	2026-02-14 17:00 GMT

Overview

Concrete tool for optimizing question-answer pairs using LLM-based rewriting provided by the Data-Juicer framework.

Description

OptimizeQAMapper is a Mapper operator that rewrites QA pairs to improve their quality, complexity, and educational value. It supports both API-based and HuggingFace model backends, vLLM acceleration, configurable system prompts, and retry logic. The optimization prompt instructs the LLM to enhance questions and answers while preserving factual accuracy.

Usage

Use as an operator in a pipeline after generation and/or calibration steps. Can use local HuggingFace models or API endpoints.

Code Reference

Source Location

Repository: data-juicer
File: data_juicer/ops/mapper/optimize_qa_mapper.py
Lines: L23-178

Signature

@OPERATORS.register_module('optimize_qa_mapper')
class OptimizeQAMapper(Mapper):
    def __init__(
        self,
        api_or_hf_model: str = None,
        is_hf_model: bool = True,
        *,
        system_prompt: str = None,
        input_template: str = None,
        output_pattern: str = None,
        enable_vllm: bool = False,
        model_params: dict = None,
        sampling_params: dict = None,
        try_num: PositiveInt = 3,
        **kwargs
    ):
        """
        Args:
            api_or_hf_model: Model name/path.
            is_hf_model: True for HuggingFace, False for API.
            system_prompt: Optimization instructions.
            input_template: Template for formatting input QA.
            output_pattern: Regex for parsing optimized output.
            enable_vllm: Use vLLM engine.
            model_params: Model loading parameters.
            sampling_params: Generation parameters.
            try_num: Retry count.
        """

    def process_single(self, sample):
        """
        Optimize a single QA pair.

        Args:
            sample: Dict with query_key and response_key.

        Returns:
            sample with optimized query and response.
        """

Import

from data_juicer.ops.mapper.optimize_qa_mapper import OptimizeQAMapper

I/O Contract

Inputs

Name	Type	Required	Description
api_or_hf_model	str	Yes	Model name/path for optimization
sample[query_key]	str	Yes	Question to optimize
sample[response_key]	str	Yes	Answer to optimize
enable_vllm	bool	No	Use vLLM engine (default: False)

Outputs

Name	Type	Description
sample[query_key]	str	Optimized question
sample[response_key]	str	Optimized answer

Usage Examples

Full Generation Pipeline

process:
  # Step 1: Generate QA from text
  - generate_qa_from_text_mapper:
      hf_model: Qwen/Qwen2.5-7B-Instruct
      max_num: 3

  # Step 2: Calibrate quality
  - calibrate_qa_mapper:
      api_model: gpt-4o

  # Step 3: Optimize complexity
  - optimize_qa_mapper:
      api_or_hf_model: Qwen/Qwen2.5-7B-Instruct
      is_hf_model: true
      enable_vllm: true
      sampling_params:
        temperature: 0.8

  # Step 4: Filter low-quality results
  - text_length_filter:
      min_len: 50

Related Pages

Implements Principle

Principle:Datajuicer_Data_juicer_QA_Optimization

Requires Environment

Environment:Datajuicer_Data_juicer_LLM_API_Credentials_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment