Heuristic:Explodinggradients Ragas LLM Temperature Defaults
| Knowledge Sources | |
|---|---|
| Domains | LLM_Evaluation, Optimization |
| Last Updated | 2026-02-10 12:00 GMT |
Overview
Ragas uses near-deterministic temperature (0.01) for single completions and 0.3 for multiple completions, with special parameter handling for OpenAI reasoning models (o-series, GPT-5+).
Description
The `BaseRagasLLM.get_temperature()` method returns 0.01 for single completions (ensuring near-deterministic, reproducible results) and 0.3 when multiple completions are requested (to get diverse outputs). The value 0.01 is used instead of 0.0 because some APIs reject exactly zero. Additionally, OpenAI reasoning models (o1, o3, GPT-5, etc.) have strict parameter constraints: temperature must be fixed at 1.0, `top_p` must be removed, and `max_tokens` must be renamed to `max_completion_tokens`. The default `max_tokens=1024` may be insufficient for structured output from reasoning models.
Usage
Apply this heuristic when debugging inconsistent metric results (check temperature), when using OpenAI reasoning models (o1, o3, GPT-5), or when structured output is being truncated (increase max_tokens). The reasoning model detection is pattern-based and covers o1-o9 and gpt-5 through gpt-19 -- future models beyond these ranges will need a code update.
The Insight (Rule of Thumb)
- Action: For reproducible evaluation results, use the default temperature (0.01). For reasoning models, increase `max_tokens` to 4096+.
- Values:
- Single completion: temperature = 0.01
- Multiple completions (n>1): temperature = 0.3
- Reasoning models (o-series): temperature forced to 1.0 by the model
- Default max_tokens: 1024 (may need 4096+ for reasoning models with structured output)
- Trade-off: Higher temperature = more diverse but less reproducible results. Lower max_tokens = cheaper but risks truncation.
- Future-proofing limitation: Reasoning model detection covers o1-o9 and gpt-5 to gpt-19. Models like o10 or gpt-20 will not be auto-detected and may use incorrect API parameters.
Reasoning
Near-deterministic temperature ensures that evaluation metrics are reproducible across runs. The 0.01 value (instead of 0.0) avoids edge cases with APIs that reject exactly zero temperature. For reasoning models, OpenAI requires `temperature=1.0` and replaces `max_tokens` with `max_completion_tokens` -- using the wrong parameter name causes API errors. The `InstructorModelArgs` docstring explicitly warns that reasoning models may need 4096+ tokens for structured JSON output to avoid truncation, which would cause parsing failures and trigger expensive retry logic.
Code Evidence
Temperature selection from `src/ragas/llms/base.py:71-73`:
def get_temperature(self, n: int) -> float:
"""Return the temperature to use for completion based on n."""
return 0.3 if n > 1 else 0.01
Reasoning model detection from `src/ragas/llms/base.py:872-898`:
def is_reasoning_model(model_str: str) -> bool:
# O-series reasoning models (o1, o1-mini, o1-2024-12-17, o2, o3, ...)
# TODO: Update to support o10+ when OpenAI releases models beyond o9
if (
len(model_str) >= 2
and model_str[0] == "o"
and model_str[1] in "123456789"
):
if len(model_str) == 2 or model_str[2] in ("-", "_"):
return True
# GPT-5 and newer (gpt-5, gpt-5-*, gpt-6, ..., gpt-19)
# TODO: Update to support gpt-20+ when OpenAI releases models beyond gpt-19
if model_str.startswith("gpt-"):
version_str = model_str[4:].split("-")[0].split("_")[0]
try:
version = int(version_str)
if 5 <= version <= 19:
return True
except ValueError:
pass
InstructorModelArgs max_tokens warning from `src/ragas/llms/base.py:754-764`:
class InstructorModelArgs(BaseModel):
"""Note: For GPT-5 and o-series models, you may need to increase max_tokens
to 4096+ for structured output to work properly."""
temperature: float = 0.01
top_p: float = 0.1
max_tokens: int = 1024
Mode.JSON default from `src/ragas/llms/base.py:574-578`:
# Note: For OpenAI, we use Mode.JSON by default instead of Mode.TOOLS because
# OpenAI's function calling (TOOLS mode) has issues with Dict type annotations
# in Pydantic models - it returns empty objects `{}` instead of proper structured
# data. Mode.JSON works correctly with all Pydantic types including Dict.