Heuristic:Truera Trulens Temperature Zero For Deterministic Scoring
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Evaluation |
| Last Updated | 2026-02-14 08:00 GMT |
Overview
All TruLens LLM-as-a-Judge feedback functions default to temperature=0.0 for deterministic, reproducible evaluation scores.
Description
Every feedback function in TruLens that uses an LLM for scoring (relevance, groundedness, sentiment, toxicity, etc.) defaults to `temperature=0.0`. This produces the most deterministic output possible from the LLM judge, minimizing random variation between evaluation runs. For reasoning models (o1, o3-mini), temperature is not passed at all; instead `reasoning_effort="medium"` is used, since reasoning models do not support the temperature parameter.
Usage
Apply this heuristic as the default for all LLM-based evaluation. Only increase temperature if you specifically want stochastic evaluation results (e.g., for testing sensitivity of scores to LLM randomness). The default of 0.0 is appropriate for production evaluation pipelines where reproducibility matters.
The Insight (Rule of Thumb)
- Action: Leave `temperature=0.0` as default for all feedback functions. Override only when intentionally testing evaluation variance.
- Value: `temperature=0.0` (deterministic mode).
- Trade-off: Fully deterministic scores (same input always produces same score) at the cost of potentially less "creative" reasoning in chain-of-thought evaluations.
- Exception: Reasoning models (o1, o3) ignore temperature entirely; `reasoning_effort` controls their behavior instead.
Reasoning
Feedback evaluation is a measurement process. Measurements should be reproducible — running the same evaluation on the same data should produce the same result. Temperature=0.0 achieves this by selecting the most probable token at each generation step (greedy decoding). This is standard practice in LLM-as-a-Judge literature, where evaluation consistency is more important than output diversity.
The Bedrock provider explicitly warns when temperature is not 0.0, reinforcing that non-zero temperature is an unusual choice for evaluation.
Code Evidence
Default temperature in `generate_score` from `src/feedback/trulens/feedback/llm_provider.py:166-172`:
def generate_score(
self,
system_prompt: str,
user_prompt: Optional[str] = None,
min_score_val: int = 0,
max_score_val: int = 10,
temperature: float = 0.0,
) -> float:
Reasoning model handling from `src/feedback/trulens/feedback/llm_provider.py:202-208`:
if self._is_reasoning_model():
extra_kwargs["reasoning_effort"] = (
"medium" # Default reasoning effort
)
# Don't pass temperature to reasoning models as they don't support it
else:
extra_kwargs["temperature"] = temperature
LiteLLM provider default from `src/providers/litellm/trulens/providers/litellm/provider.py:158`:
completion_args.setdefault("temperature", 0.0)
LangChain provider default from `src/providers/langchain/trulens/providers/langchain/provider.py:111`:
call_kwargs.setdefault("temperature", 0.0)
Cortex provider default from `src/providers/cortex/trulens/providers/cortex/provider.py:206-207`:
kwargs["temperature"] = 0.0