Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Datajuicer Data juicer OptimizeQAMapper Process

From Leeroopedia
Knowledge Sources
Domains NLP, Data_Quality, LLM
Last Updated 2026-02-14 17:00 GMT

Overview

Concrete tool for optimizing question-answer pairs using LLM-based rewriting provided by the Data-Juicer framework.

Description

OptimizeQAMapper is a Mapper operator that rewrites QA pairs to improve their quality, complexity, and educational value. It supports both API-based and HuggingFace model backends, vLLM acceleration, configurable system prompts, and retry logic. The optimization prompt instructs the LLM to enhance questions and answers while preserving factual accuracy.

Usage

Use as an operator in a pipeline after generation and/or calibration steps. Can use local HuggingFace models or API endpoints.

Code Reference

Source Location

  • Repository: data-juicer
  • File: data_juicer/ops/mapper/optimize_qa_mapper.py
  • Lines: L23-178

Signature

@OPERATORS.register_module('optimize_qa_mapper')
class OptimizeQAMapper(Mapper):
    def __init__(
        self,
        api_or_hf_model: str = None,
        is_hf_model: bool = True,
        *,
        system_prompt: str = None,
        input_template: str = None,
        output_pattern: str = None,
        enable_vllm: bool = False,
        model_params: dict = None,
        sampling_params: dict = None,
        try_num: PositiveInt = 3,
        **kwargs
    ):
        """
        Args:
            api_or_hf_model: Model name/path.
            is_hf_model: True for HuggingFace, False for API.
            system_prompt: Optimization instructions.
            input_template: Template for formatting input QA.
            output_pattern: Regex for parsing optimized output.
            enable_vllm: Use vLLM engine.
            model_params: Model loading parameters.
            sampling_params: Generation parameters.
            try_num: Retry count.
        """

    def process_single(self, sample):
        """
        Optimize a single QA pair.

        Args:
            sample: Dict with query_key and response_key.

        Returns:
            sample with optimized query and response.
        """

Import

from data_juicer.ops.mapper.optimize_qa_mapper import OptimizeQAMapper

I/O Contract

Inputs

Name Type Required Description
api_or_hf_model str Yes Model name/path for optimization
sample[query_key] str Yes Question to optimize
sample[response_key] str Yes Answer to optimize
enable_vllm bool No Use vLLM engine (default: False)

Outputs

Name Type Description
sample[query_key] str Optimized question
sample[response_key] str Optimized answer

Usage Examples

Full Generation Pipeline

process:
  # Step 1: Generate QA from text
  - generate_qa_from_text_mapper:
      hf_model: Qwen/Qwen2.5-7B-Instruct
      max_num: 3

  # Step 2: Calibrate quality
  - calibrate_qa_mapper:
      api_model: gpt-4o

  # Step 3: Optimize complexity
  - optimize_qa_mapper:
      api_or_hf_model: Qwen/Qwen2.5-7B-Instruct
      is_hf_model: true
      enable_vllm: true
      sampling_params:
        temperature: 0.8

  # Step 4: Filter low-quality results
  - text_length_filter:
      min_len: 50

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment