Implementation:Datajuicer Data juicer GenerateQAFromTextMapper Process
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Generation, LLM |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Concrete tool for generating question-answer pairs from source text using LLMs provided by the Data-Juicer framework.
Description
GenerateQAFromTextMapper is a batched Mapper operator that uses an LLM (via HuggingFace transformers or vLLM) to generate QA pairs from input text. It formats source text into prompts using a configurable template, calls the LLM for generation, parses the output using regex patterns, and writes the results to query/response fields. It supports multiple QA pairs per input text, retry logic, and both API-based and local model backends.
Usage
Use as an operator in a Data-Juicer pipeline. Configure with a model name, optionally customize the prompt template and output pattern.
Code Reference
Source Location
- Repository: data-juicer
- File: data_juicer/ops/mapper/generate_qa_from_text_mapper.py
- Lines: L24-146
Signature
@OPERATORS.register_module('generate_qa_from_text_mapper')
class GenerateQAFromTextMapper(Mapper):
_batched_op = True
def __init__(
self,
hf_model: str = None,
max_num: PositiveInt = None,
*,
output_pattern: str = None,
enable_vllm: bool = False,
model_params: dict = None,
sampling_params: dict = None,
**kwargs
):
"""
Args:
hf_model: HuggingFace model name/path for generation.
max_num: Maximum QA pairs to generate per text.
output_pattern: Regex pattern for parsing LLM output.
enable_vllm: Use vLLM engine for inference.
model_params: Model loading parameters.
sampling_params: Generation parameters (temperature, top_p, etc.).
"""
def process_batched(self, samples):
"""
Generate QA pairs for a batch of samples.
Args:
samples: Dict of lists (batched format) with text_key.
Returns:
samples with query_key and response_key populated.
"""
Import
from data_juicer.ops.mapper.generate_qa_from_text_mapper import GenerateQAFromTextMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| hf_model | str | Yes | Model name/path for generation |
| samples[text_key] | List[str] | Yes | Source texts to generate QA from |
| max_num | PositiveInt | No | Max QA pairs per text |
| enable_vllm | bool | No | Use vLLM engine (default: False) |
Outputs
| Name | Type | Description |
|---|---|---|
| samples[query_key] | List[str] | Generated questions |
| samples[response_key] | List[str] | Generated answers |
Usage Examples
YAML Configuration
process:
- generate_qa_from_text_mapper:
hf_model: Qwen/Qwen2.5-7B-Instruct
max_num: 3
enable_vllm: true
sampling_params:
temperature: 0.7
max_new_tokens: 512
Programmatic Usage
from data_juicer.ops.mapper.generate_qa_from_text_mapper import GenerateQAFromTextMapper
mapper = GenerateQAFromTextMapper(
hf_model='Qwen/Qwen2.5-7B-Instruct',
max_num=3,
enable_vllm=True,
sampling_params={'temperature': 0.7}
)
# Apply to dataset
result = dataset.process([mapper])