Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Run llama Llama index GuidelineEvaluator

From Leeroopedia
Knowledge Sources
Domains Evaluation, Guideline
Last Updated 2026-02-11 19:00 GMT

Overview

Evaluates whether a query-response pair adheres to a set of user-defined or default guidelines, returning a pass/fail result with structured feedback.

Description

The GuidelineEvaluator is a concrete implementation of BaseEvaluator that assesses whether a generated response follows specified quality guidelines. Unlike scoring-based evaluators, this evaluator produces a binary passing result (True/False) along with detailed feedback.

The evaluation workflow is:

  1. The LLM is prompted with the query, response, and guidelines using a configurable eval_template.
  2. The LLM output is parsed through a PydanticOutputParser that extracts an EvaluationData object containing a passing boolean and a feedback string.
  3. The result is converted to an EvaluationResult with score set to 1.0 if passing or 0.0 if not.

The default guidelines instruct the LLM to check that:

  • The response fully answers the query.
  • The response avoids being vague or ambiguous.
  • The response is specific and uses statistics or numbers when possible.

Custom guidelines can be provided as a string during initialization. The eval_template defaults to a prompt that presents the query, response, and guidelines, then asks for constructive criticism. The output parser can also be customized by passing a different PydanticOutputParser instance.

The evaluator only considers the query and response parameters; the contexts parameter is ignored.

Usage

Use this evaluator when you need to enforce specific quality standards on LLM responses, such as checking for specificity, factual grounding, or adherence to a style guide. It is useful in production pipelines where binary pass/fail gating is needed rather than continuous scoring.

Code Reference

Source Location

Signature

class GuidelineEvaluator(BaseEvaluator):
    def __init__(
        self,
        llm: Optional[LLM] = None,
        guidelines: Optional[str] = None,
        eval_template: Optional[Union[str, BasePromptTemplate]] = None,
        output_parser: Optional[PydanticOutputParser] = None,
    ) -> None: ...

    async def aevaluate(
        self,
        query: Optional[str] = None,
        response: Optional[str] = None,
        contexts: Optional[Sequence[str]] = None,
        sleep_time_in_seconds: int = 0,
        **kwargs: Any,
    ) -> EvaluationResult: ...

Import

from llama_index.core.evaluation.guideline import GuidelineEvaluator

I/O Contract

Inputs

Name Type Required Description
llm Optional[LLM] No The LLM to use for evaluation. Defaults to Settings.llm.
guidelines Optional[str] No Custom guidelines for evaluating the response. Defaults to built-in guidelines about specificity and completeness.
eval_template Optional[Union[str, BasePromptTemplate]] No Custom evaluation prompt template. Defaults to the built-in template.
output_parser Optional[PydanticOutputParser] No Custom output parser. Defaults to a PydanticOutputParser for EvaluationData.
query str Yes (aevaluate) The user query to evaluate against.
response str Yes (aevaluate) The generated response to evaluate.
sleep_time_in_seconds int No (aevaluate) Delay before evaluation for rate limiting. Defaults to 0.

Outputs

Name Type Description
result EvaluationResult Contains the query, response, passing (bool), score (1.0 or 0.0), and feedback from the LLM.

Usage Examples

from llama_index.core.evaluation.guideline import GuidelineEvaluator
from llama_index.core.llms import OpenAI

# Create evaluator with custom guidelines
evaluator = GuidelineEvaluator(
    llm=OpenAI(model="gpt-4"),
    guidelines=(
        "The response must include specific dates or timeframes.\n"
        "The response must cite at least one source.\n"
        "The response must not exceed 200 words.\n"
    ),
)

# Evaluate a response
result = await evaluator.aevaluate(
    query="When was Python first released?",
    response="Python was first released on February 20, 1991 by Guido van Rossum.",
)

print(f"Passing: {result.passing}")    # True or False
print(f"Score: {result.score}")        # 1.0 or 0.0
print(f"Feedback: {result.feedback}")  # Detailed critique

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment