Implementation:Marker Inc Korea AutoRAG Generate QA LlamaIndex
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, QA_Generation, NLP |
| Last Updated | 2026-02-08 06:00 GMT |
Overview
Concrete tool for generating QA pairs from content strings using LlamaIndex LLM models with single-prompt and ratio-based multi-prompt strategies.
Description
⚠️ LEGACY/DEPRECATED: This module is in the legacy/ directory and is superseded by the modern QA schema pipeline. See Heuristic:Marker_Inc_Korea_AutoRAG_Warning_Deprecated_Legacy_QA_Creation.
This module provides LlamaIndex-backed QA generation for the legacy data creation pipeline. generate_qa_llama_index uses a single prompt template with Template:Text and Template:Num questions placeholders to generate QA pairs asynchronously. generate_qa_llama_index_by_ratio distributes content across multiple prompts by specified ratios, enabling diverse question types. generate_answers creates answers for existing queries using a system prompt. Internally, async_qa_gen_llama_index calls llm.acomplete with retry logic and parses [Q]:/[A]: formatted output via parse_output.
Usage
Import these functions when using the legacy QA creation pipeline with a LlamaIndex LLM backend. Pass generate_qa_llama_index as the qa_creation_func to make_single_content_qa, or use generate_qa_llama_index_by_ratio when you want to vary question styles across different prompt templates.
Code Reference
Source Location
- Repository: Marker_Inc_Korea_AutoRAG
- File: autorag/data/legacy/qacreation/llama_index.py
- Lines: 1-253
Signature
def generate_qa_llama_index(
llm: LLM,
contents: List[str],
prompt: Optional[str] = None,
question_num_per_content: int = 1,
max_retries: int = 3,
batch: int = 4,
) -> List[List[Dict]]:
"""
Generate a qa set from the list of contents using a single prompt.
:param llm: Llama index model.
:param contents: List of content strings.
:param prompt: Prompt with {{text}} and {{num_questions}} placeholders.
:param question_num_per_content: Questions per content (default 1).
:param max_retries: Retry limit for incorrect output length (default 3).
:param batch: Async batch size (default 4).
:return: 2-d list of dicts with 'query' and 'generation_gt'.
"""
def generate_qa_llama_index_by_ratio(
llm: LLM,
contents: List[str],
prompts_ratio: Dict,
question_num_per_content: int = 1,
max_retries: int = 3,
random_state: int = 42,
batch: int = 4,
) -> List[List[Dict]]:
"""
Generate QA set with multiple prompts distributed by ratio.
:param llm: Llama index model.
:param contents: List of content strings.
:param prompts_ratio: Dict of {prompt_path: ratio}.
:param question_num_per_content: Questions per content (default 1).
:param max_retries: Retry limit (default 3).
:param random_state: Random seed (default 42).
:param batch: Async batch size (default 4).
:return: 2-d list of dicts with 'query' and 'generation_gt'.
"""
def generate_answers(
llm: LLM,
contents: List[str],
queries: List[str],
batch: int = 4,
) -> List[List[Dict]]:
"""
Generate answers for existing queries given content strings.
"""
Import
from autorag.data.legacy.qacreation.llama_index import (
generate_qa_llama_index,
generate_qa_llama_index_by_ratio,
generate_answers,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| llm | llama_index.core.llms.LLM | Yes | LlamaIndex LLM instance for text generation |
| contents | List[str] | Yes | List of content strings to generate QA from |
| prompt | str | No | Prompt template with Template:Text and Template:Num questions placeholders |
| question_num_per_content | int | No | Number of QA pairs per content (default 1) |
| max_retries | int | No | Maximum retries if output length mismatches (default 3) |
| batch | int | No | Async processing batch size (default 4) |
| prompts_ratio | Dict | For ratio variant | Dict mapping prompt file paths to ratios |
Outputs
| Name | Type | Description |
|---|---|---|
| result | List[List[Dict]] | 2-d list; each inner list contains dicts with query (str) and generation_gt (str) keys |
Usage Examples
Single Prompt QA Generation
from llama_index.llms.openai import OpenAI
from autorag.data.legacy.qacreation.llama_index import generate_qa_llama_index
llm = OpenAI(model="gpt-3.5-turbo")
contents = [
"The Eiffel Tower is located in Paris, France. It was built in 1889.",
"Python is a high-level programming language created by Guido van Rossum.",
]
results = generate_qa_llama_index(
llm=llm,
contents=contents,
question_num_per_content=2,
batch=2,
)
# results[0] = [{"query": "...", "generation_gt": "..."}, {"query": "...", "generation_gt": "..."}]
for content_qas in results:
for qa in content_qas:
print(f"Q: {qa['query']}")
print(f"A: {qa['generation_gt']}")
Multi-Prompt Ratio-Based Generation
from autorag.data.legacy.qacreation.llama_index import generate_qa_llama_index_by_ratio
# Use different prompts for different question styles
prompts_ratio = {
"/path/to/factoid_prompt.txt": 0.6,
"/path/to/reasoning_prompt.txt": 0.4,
}
results = generate_qa_llama_index_by_ratio(
llm=llm,
contents=contents,
prompts_ratio=prompts_ratio,
question_num_per_content=1,
)