Heuristic:Marker Inc Korea AutoRAG Deterministic Evaluation Generation
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, RAG, Data_Quality |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
Use temperature=0 and minimal token budgets for ground truth generation and metric evaluation to ensure reproducible, cost-efficient evaluation datasets.
Description
AutoRAG enforces deterministic LLM outputs (temperature=0) in two critical contexts: (1) when generating ground-truth answers for evaluation datasets, and (2) when computing LLM-based evaluation metrics (e.g., faithfulness scoring). Additionally, metric evaluation calls use minimal token budgets (max_tokens=2) when only a classification label is needed. This combination ensures that evaluation results are reproducible across runs and that API costs for evaluation are minimized.
Usage
This heuristic is automatically applied in AutoRAG's data creation and evaluation pipelines. When creating custom QA datasets or evaluation metrics, follow the same pattern: set `temperature=0` for any LLM call whose output serves as ground truth or evaluation signal.
The Insight (Rule of Thumb)
- Action: Set `temperature=0` for all LLM calls that produce ground truth answers or evaluation scores.
- Value: `temperature=0.0`, and `max_tokens=2` for metric scoring calls.
- Trade-off: Sacrifices output diversity for reproducibility. Ground truth generation produces only one canonical answer per question rather than exploring multiple valid responses.
Reasoning
Evaluation requires determinism: if the same input produces different outputs on different runs, metric scores become noisy and unreliable. Temperature=0 makes the LLM output (near-)deterministic by always selecting the highest-probability token. For metric evaluation specifically, the LLM only needs to output a short classification (e.g., "True"/"False" or a score), so max_tokens=2 avoids wasting API tokens on unnecessary generation. The 7-token buffer subtracted from max token limits accounts for chat message overhead in the OpenAI API.
Code Evidence
Ground truth answer generation from `autorag/data/qa/generation_gt/openai_gen_gt.py:34`:
temperature=0.0,
LlamaIndex ground truth generation from `autorag/data/qa/generation_gt/llama_index_gen_gt.py:35`:
temperature=0.0,
Metric evaluation minimal tokens from `autorag/evaluation/metric/generation.py:442-443`:
temperature=0,
max_tokens=2,
Chat token overhead accounting from `autorag/nodes/generator/openai_llm.py:84-86`:
self.max_token_size = (
MAX_TOKEN_DICT.get(self.llm) - 7
) # because of chat token usage