Implementation:Run llama Llama index GuidelineEvaluator
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Guideline |
| Last Updated | 2026-02-11 19:00 GMT |
Overview
Evaluates whether a query-response pair adheres to a set of user-defined or default guidelines, returning a pass/fail result with structured feedback.
Description
The GuidelineEvaluator is a concrete implementation of BaseEvaluator that assesses whether a generated response follows specified quality guidelines. Unlike scoring-based evaluators, this evaluator produces a binary passing result (True/False) along with detailed feedback.
The evaluation workflow is:
- The LLM is prompted with the query, response, and guidelines using a configurable eval_template.
- The LLM output is parsed through a PydanticOutputParser that extracts an EvaluationData object containing a passing boolean and a feedback string.
- The result is converted to an EvaluationResult with score set to 1.0 if passing or 0.0 if not.
The default guidelines instruct the LLM to check that:
- The response fully answers the query.
- The response avoids being vague or ambiguous.
- The response is specific and uses statistics or numbers when possible.
Custom guidelines can be provided as a string during initialization. The eval_template defaults to a prompt that presents the query, response, and guidelines, then asks for constructive criticism. The output parser can also be customized by passing a different PydanticOutputParser instance.
The evaluator only considers the query and response parameters; the contexts parameter is ignored.
Usage
Use this evaluator when you need to enforce specific quality standards on LLM responses, such as checking for specificity, factual grounding, or adherence to a style guide. It is useful in production pipelines where binary pass/fail gating is needed rather than continuous scoring.
Code Reference
Source Location
- Repository: Run_llama_Llama_index
- File: llama-index-core/llama_index/core/evaluation/guideline.py
Signature
class GuidelineEvaluator(BaseEvaluator):
def __init__(
self,
llm: Optional[LLM] = None,
guidelines: Optional[str] = None,
eval_template: Optional[Union[str, BasePromptTemplate]] = None,
output_parser: Optional[PydanticOutputParser] = None,
) -> None: ...
async def aevaluate(
self,
query: Optional[str] = None,
response: Optional[str] = None,
contexts: Optional[Sequence[str]] = None,
sleep_time_in_seconds: int = 0,
**kwargs: Any,
) -> EvaluationResult: ...
Import
from llama_index.core.evaluation.guideline import GuidelineEvaluator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| llm | Optional[LLM] | No | The LLM to use for evaluation. Defaults to Settings.llm. |
| guidelines | Optional[str] | No | Custom guidelines for evaluating the response. Defaults to built-in guidelines about specificity and completeness. |
| eval_template | Optional[Union[str, BasePromptTemplate]] | No | Custom evaluation prompt template. Defaults to the built-in template. |
| output_parser | Optional[PydanticOutputParser] | No | Custom output parser. Defaults to a PydanticOutputParser for EvaluationData. |
| query | str | Yes (aevaluate) | The user query to evaluate against. |
| response | str | Yes (aevaluate) | The generated response to evaluate. |
| sleep_time_in_seconds | int | No (aevaluate) | Delay before evaluation for rate limiting. Defaults to 0. |
Outputs
| Name | Type | Description |
|---|---|---|
| result | EvaluationResult | Contains the query, response, passing (bool), score (1.0 or 0.0), and feedback from the LLM. |
Usage Examples
from llama_index.core.evaluation.guideline import GuidelineEvaluator
from llama_index.core.llms import OpenAI
# Create evaluator with custom guidelines
evaluator = GuidelineEvaluator(
llm=OpenAI(model="gpt-4"),
guidelines=(
"The response must include specific dates or timeframes.\n"
"The response must cite at least one source.\n"
"The response must not exceed 200 words.\n"
),
)
# Evaluate a response
result = await evaluator.aevaluate(
query="When was Python first released?",
response="Python was first released on February 20, 1991 by Guido van Rossum.",
)
print(f"Passing: {result.passing}") # True or False
print(f"Score: {result.score}") # 1.0 or 0.0
print(f"Feedback: {result.feedback}") # Detailed critique