Heuristic:Marker Inc Korea AutoRAG Deterministic Evaluation Generation

Knowledge Sources	AutoRAG
Domains	Evaluation, RAG, Data_Quality
Last Updated	2026-02-12 00:00 GMT

Overview

Use temperature=0 and minimal token budgets for ground truth generation and metric evaluation to ensure reproducible, cost-efficient evaluation datasets.

Description

AutoRAG enforces deterministic LLM outputs (temperature=0) in two critical contexts: (1) when generating ground-truth answers for evaluation datasets, and (2) when computing LLM-based evaluation metrics (e.g., faithfulness scoring). Additionally, metric evaluation calls use minimal token budgets (max_tokens=2) when only a classification label is needed. This combination ensures that evaluation results are reproducible across runs and that API costs for evaluation are minimized.

Usage

This heuristic is automatically applied in AutoRAG's data creation and evaluation pipelines. When creating custom QA datasets or evaluation metrics, follow the same pattern: set `temperature=0` for any LLM call whose output serves as ground truth or evaluation signal.

The Insight (Rule of Thumb)

Action: Set `temperature=0` for all LLM calls that produce ground truth answers or evaluation scores.
Value: `temperature=0.0`, and `max_tokens=2` for metric scoring calls.
Trade-off: Sacrifices output diversity for reproducibility. Ground truth generation produces only one canonical answer per question rather than exploring multiple valid responses.

Reasoning

Evaluation requires determinism: if the same input produces different outputs on different runs, metric scores become noisy and unreliable. Temperature=0 makes the LLM output (near-)deterministic by always selecting the highest-probability token. For metric evaluation specifically, the LLM only needs to output a short classification (e.g., "True"/"False" or a score), so max_tokens=2 avoids wasting API tokens on unnecessary generation. The 7-token buffer subtracted from max token limits accounts for chat message overhead in the OpenAI API.

Code Evidence

Ground truth answer generation from `autorag/data/qa/generation_gt/openai_gen_gt.py:34`:

temperature=0.0,

LlamaIndex ground truth generation from `autorag/data/qa/generation_gt/llama_index_gen_gt.py:35`:

temperature=0.0,

Metric evaluation minimal tokens from `autorag/evaluation/metric/generation.py:442-443`:

temperature=0,
max_tokens=2,

Chat token overhead accounting from `autorag/nodes/generator/openai_llm.py:84-86`:

self.max_token_size = (
    MAX_TOKEN_DICT.get(self.llm) - 7
)  # because of chat token usage

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment