Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Truera Trulens Temperature Zero For Deterministic Scoring

From Leeroopedia
Knowledge Sources
Domains LLMs, Evaluation
Last Updated 2026-02-14 08:00 GMT

Overview

All TruLens LLM-as-a-Judge feedback functions default to temperature=0.0 for deterministic, reproducible evaluation scores.

Description

Every feedback function in TruLens that uses an LLM for scoring (relevance, groundedness, sentiment, toxicity, etc.) defaults to `temperature=0.0`. This produces the most deterministic output possible from the LLM judge, minimizing random variation between evaluation runs. For reasoning models (o1, o3-mini), temperature is not passed at all; instead `reasoning_effort="medium"` is used, since reasoning models do not support the temperature parameter.

Usage

Apply this heuristic as the default for all LLM-based evaluation. Only increase temperature if you specifically want stochastic evaluation results (e.g., for testing sensitivity of scores to LLM randomness). The default of 0.0 is appropriate for production evaluation pipelines where reproducibility matters.

The Insight (Rule of Thumb)

  • Action: Leave `temperature=0.0` as default for all feedback functions. Override only when intentionally testing evaluation variance.
  • Value: `temperature=0.0` (deterministic mode).
  • Trade-off: Fully deterministic scores (same input always produces same score) at the cost of potentially less "creative" reasoning in chain-of-thought evaluations.
  • Exception: Reasoning models (o1, o3) ignore temperature entirely; `reasoning_effort` controls their behavior instead.

Reasoning

Feedback evaluation is a measurement process. Measurements should be reproducible — running the same evaluation on the same data should produce the same result. Temperature=0.0 achieves this by selecting the most probable token at each generation step (greedy decoding). This is standard practice in LLM-as-a-Judge literature, where evaluation consistency is more important than output diversity.

The Bedrock provider explicitly warns when temperature is not 0.0, reinforcing that non-zero temperature is an unusual choice for evaluation.

Code Evidence

Default temperature in `generate_score` from `src/feedback/trulens/feedback/llm_provider.py:166-172`:

def generate_score(
    self,
    system_prompt: str,
    user_prompt: Optional[str] = None,
    min_score_val: int = 0,
    max_score_val: int = 10,
    temperature: float = 0.0,
) -> float:

Reasoning model handling from `src/feedback/trulens/feedback/llm_provider.py:202-208`:

if self._is_reasoning_model():
    extra_kwargs["reasoning_effort"] = (
        "medium"  # Default reasoning effort
    )
    # Don't pass temperature to reasoning models as they don't support it
else:
    extra_kwargs["temperature"] = temperature

LiteLLM provider default from `src/providers/litellm/trulens/providers/litellm/provider.py:158`:

completion_args.setdefault("temperature", 0.0)

LangChain provider default from `src/providers/langchain/trulens/providers/langchain/provider.py:111`:

call_kwargs.setdefault("temperature", 0.0)

Cortex provider default from `src/providers/cortex/trulens/providers/cortex/provider.py:206-207`:

kwargs["temperature"] = 0.0

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment