Implementation:Datajuicer Data juicer GenerateQAFromTextMapper Process

Knowledge Sources	Data-Juicer
Domains	NLP, Data_Generation, LLM
Last Updated	2026-02-14 17:00 GMT

Overview

Concrete tool for generating question-answer pairs from source text using LLMs provided by the Data-Juicer framework.

Description

GenerateQAFromTextMapper is a batched Mapper operator that uses an LLM (via HuggingFace transformers or vLLM) to generate QA pairs from input text. It formats source text into prompts using a configurable template, calls the LLM for generation, parses the output using regex patterns, and writes the results to query/response fields. It supports multiple QA pairs per input text, retry logic, and both API-based and local model backends.

Usage

Use as an operator in a Data-Juicer pipeline. Configure with a model name, optionally customize the prompt template and output pattern.

Code Reference

Source Location

Repository: data-juicer
File: data_juicer/ops/mapper/generate_qa_from_text_mapper.py
Lines: L24-146

Signature

@OPERATORS.register_module('generate_qa_from_text_mapper')
class GenerateQAFromTextMapper(Mapper):
    _batched_op = True

    def __init__(
        self,
        hf_model: str = None,
        max_num: PositiveInt = None,
        *,
        output_pattern: str = None,
        enable_vllm: bool = False,
        model_params: dict = None,
        sampling_params: dict = None,
        **kwargs
    ):
        """
        Args:
            hf_model: HuggingFace model name/path for generation.
            max_num: Maximum QA pairs to generate per text.
            output_pattern: Regex pattern for parsing LLM output.
            enable_vllm: Use vLLM engine for inference.
            model_params: Model loading parameters.
            sampling_params: Generation parameters (temperature, top_p, etc.).
        """

    def process_batched(self, samples):
        """
        Generate QA pairs for a batch of samples.

        Args:
            samples: Dict of lists (batched format) with text_key.

        Returns:
            samples with query_key and response_key populated.
        """

Import

from data_juicer.ops.mapper.generate_qa_from_text_mapper import GenerateQAFromTextMapper

I/O Contract

Inputs

Name	Type	Required	Description
hf_model	str	Yes	Model name/path for generation
samples[text_key]	List[str]	Yes	Source texts to generate QA from
max_num	PositiveInt	No	Max QA pairs per text
enable_vllm	bool	No	Use vLLM engine (default: False)

Outputs

Name	Type	Description
samples[query_key]	List[str]	Generated questions
samples[response_key]	List[str]	Generated answers

Usage Examples

YAML Configuration

process:
  - generate_qa_from_text_mapper:
      hf_model: Qwen/Qwen2.5-7B-Instruct
      max_num: 3
      enable_vllm: true
      sampling_params:
        temperature: 0.7
        max_new_tokens: 512

Programmatic Usage

from data_juicer.ops.mapper.generate_qa_from_text_mapper import GenerateQAFromTextMapper

mapper = GenerateQAFromTextMapper(
    hf_model='Qwen/Qwen2.5-7B-Instruct',
    max_num=3,
    enable_vllm=True,
    sampling_params={'temperature': 0.7}
)

# Apply to dataset
result = dataset.process([mapper])

Related Pages

Implements Principle

Principle:Datajuicer_Data_juicer_LLM_Content_Generation

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment