Implementation:EvolvingLMMs Lab Lmms eval LLM Judge Utils

Overview

This implementation provides utility classes for building evaluation prompts and parsing judge responses. It bridges the gap between raw evaluation requirements and structured API calls, handling prompt template formatting and response extraction with robust error handling.

File Location

/tmp/kapso_repo_sslb_59s/lmms_eval/llm_judge/utils.py (115 lines)

Related Principle

LLM as Judge

Dependencies

re: Regular expression parsing
typing: Type hints
LLM Judge Prompt Templates: BINARY_JUDGE_PROMPT, COMPARATIVE_JUDGE_PROMPT, CORRECTNESS_JUDGE_PROMPT

Core Components

JudgePromptBuilder

Static helper class for building evaluation prompts from templates.

build_binary_prompt

@staticmethod
def build_binary_prompt(
    question: str,
    answer: str,
    prediction: str,
    output_format: str = "0/1",
    custom_prompt: Optional[str] = None,
    **kwargs
) -> str

Build prompt for binary correctness evaluation.

Parameters:

question (str): The question being evaluated
answer (str): Ground truth answer
prediction (str): Model's predicted answer to evaluate
output_format (str): Output format, "0/1" or "yes/no" (default: "0/1")
custom_prompt (Optional[str]): Custom prompt template (overrides default)
**kwargs: Additional formatting arguments

Returns:

str: Formatted evaluation prompt

Logic:

If custom_prompt provided:
Formats with question, answer, pred (alias), prediction, and kwargs
Returns formatted custom prompt
Otherwise:
Determines positive/negative symbols from output_format:
- "0/1" or "1/0" → positive="1", negative="0"
- Otherwise → positive="Yes", negative="No"
Formats BINARY_JUDGE_PROMPT with all parameters
Returns formatted standard prompt

Example:

# Standard prompt
prompt = JudgePromptBuilder.build_binary_prompt(
    question="What is the capital of France?",
    answer="Paris",
    prediction="The capital is Paris",
    output_format="0/1"
)

# Custom prompt
custom = "Q: {question}\nA: {answer}\nP: {prediction}\nCorrect? (1/0)"
prompt = JudgePromptBuilder.build_binary_prompt(
    question="...",
    answer="...",
    prediction="...",
    custom_prompt=custom
)

build_comparative_prompt

@staticmethod
def build_comparative_prompt(
    question: str,
    response1: str,
    response2: str,
    context: Optional[str] = None,
    score_range: Tuple[int, int] = (1, 10),
    custom_prompt: Optional[str] = None,
    evaluation_instruction: Optional[str] = None,
    **kwargs
) -> str

Build prompt for comparative evaluation of two responses.

Parameters:

question (str): The question both responses address
response1 (str): First response to evaluate
response2 (str): Second response to evaluate
context (Optional[str]): Additional context for evaluation (default: None)
score_range (Tuple[int, int]): Min and max scores (default: (1, 10))
custom_prompt (Optional[str]): Custom prompt template
evaluation_instruction (Optional[str]): Custom evaluation instructions
**kwargs: Additional formatting arguments

Returns:

str: Formatted comparative evaluation prompt

Logic:

If custom_prompt provided:
Formats with question, response1, response2, context (or ""), and kwargs
Returns formatted custom prompt
Otherwise:
Constructs context_section:
- If context exists: "[Context]\n{context}\n\n"
- Otherwise: ""
If evaluation_instruction not provided:
- Sets default: "Please provide scores from {min} to {max}."
Formats COMPARATIVE_JUDGE_PROMPT with all parameters
Returns formatted standard prompt

Example:

prompt = JudgePromptBuilder.build_comparative_prompt(
    question="Explain quantum computing",
    response1="Quantum computing uses qubits...",
    response2="Quantum computers leverage superposition...",
    context="For a high school audience",
    score_range=(1, 10),
    evaluation_instruction="Focus on clarity and accuracy."
)

build_correctness_prompt

@staticmethod
def build_correctness_prompt(
    question: str,
    answer: str,
    prediction: str,
    output_format: str = "yes/no",
    **kwargs
) -> str

Build prompt for mathematical/semantic correctness evaluation, ignoring formatting.

Parameters:

question (str): The question being evaluated
answer (str): Correct answer
prediction (str): Solution to evaluate
output_format (str): Output format, "yes/no" or "0/1" (default: "yes/no")
**kwargs: Additional formatting arguments (unused)

Returns:

str: Formatted correctness evaluation prompt

Logic:

Determines positive/negative symbols:
- "yes/no" → positive="Yes", negative="No"
- Otherwise → positive="1", negative="0"
Formats CORRECTNESS_JUDGE_PROMPT with question, answer, prediction, symbols
Returns formatted prompt

Example:

prompt = JudgePromptBuilder.build_correctness_prompt(
    question="Solve: x^2 = 16",
    answer="x = 4 or x = -4",
    prediction="\\boxed{x = \\pm 4}",
    output_format="yes/no"
)

ResponseParser

Static helper class for parsing different types of judge responses.

parse_binary_response

@staticmethod
def parse_binary_response(response: str, output_format: str = "0/1") -> Union[int, bool]

Parse binary response (0/1 or yes/no) from judge output.

Parameters:

response (str): Raw response text from judge
output_format (str): Expected format, "0/1" or "yes/no" (default: "0/1")

Returns:

Union[int, bool]:
- For "0/1" or "1/0": Returns 1 (correct) or 0 (incorrect)
- For "yes/no": Returns True or False

Logic:

Strips and lowercases response
If output_format is "0/1" or "1/0":
Checks if response contains any of: "1", "[1]", "score: 1", "answer: 1"
Returns 1 if found, 0 otherwise
Otherwise (yes/no format):
Returns True if response == "yes" or starts with "yes"
Returns False otherwise

Robustness:

Handles various formats: "1", "[1]", "score: 1", "answer: 1"
Case-insensitive matching
Flexible yes/no detection (prefix match)
Defaults to negative (0/False) for ambiguous responses

Example:

# Various formats
parse_binary_response("1", "0/1")                # → 1
parse_binary_response("[1]", "0/1")              # → 1
parse_binary_response("score: 1", "0/1")         # → 1
parse_binary_response("0", "0/1")                # → 0
parse_binary_response("Yes", "yes/no")           # → True
parse_binary_response("yes, correct", "yes/no")  # → True
parse_binary_response("No", "yes/no")            # → False

parse_score_response

@staticmethod
def parse_score_response(
    response: str,
    score_range: Optional[Tuple[float, float]] = None
) -> float

Parse a single numerical score from response text.

Parameters:

response (str): Raw response text containing a score
score_range (Optional[Tuple[float, float]]): Valid (min, max) range for clamping

Returns:

float: Extracted score (clamped to range if provided)

Logic:

Tries to extract first number from response:
Uses regex: r"-?\d+(?:\.\d+)?"
Finds all matching numbers (including decimals and negatives)
If numbers found:
Converts first number to float
If score_range provided:
Clamps score to [min, max] range
Returns score
On any exception or no numbers found:
Returns minimum score (score_range[0]) or 0.0 if no range

Robustness:

Extracts first number from verbose responses
Handles integers and decimals
Handles negative numbers
Clamps to valid range
Graceful fallback to minimum score

Example:

parse_score_response("Score: 8.5", (1, 10))           # → 8.5
parse_score_response("The score is 12", (1, 10))      # → 10.0 (clamped)
parse_score_response("Rating: -5", (1, 10))           # → 1.0 (clamped)
parse_score_response("No valid score", (1, 10))       # → 1.0 (fallback)
parse_score_response("3.14159")                       # → 3.14159

parse_comparative_response

@staticmethod
def parse_comparative_response(response: str) -> Tuple[float, float]

Parse two comparative scores from response text.

Parameters:

response (str): Raw response text (expected format: "score1 score2" on first line)

Returns:

Tuple[float, float]: (score1, score2) for the two responses being compared

Logic:

Tries to extract scores:
Splits response into lines
Gets first line (score line)
Normalizes separators: replaces "," and ";" with " "
Extracts all numbers with regex: r"-?\d+(?:\.\d+)?"
If at least 2 numbers found:
Returns (first_number, second_number) as floats
On any exception or insufficient numbers:
Returns (-1.0, -1.0) as error sentinel

Robustness:

Handles multiple separator formats (space, comma, semicolon)
Focuses on first line (ignores explanation text)
Extracts decimals and negatives
Clear error sentinel (-1.0, -1.0) for parse failures

Example:

response = """8 7
Assistant 1 provided more detail and accuracy..."""
parse_comparative_response(response)              # → (8.0, 7.0)

parse_comparative_response("8.5, 7.0")            # → (8.5, 7.0)
parse_comparative_response("Score: 9; Score: 6")  # → (9.0, 6.0)
parse_comparative_response("Invalid")             # → (-1.0, -1.0)

parse_json_response

@staticmethod
def parse_json_response(response: str) -> Dict[str, Any]

Parse JSON-formatted response from judge output.

Parameters:

response (str): Raw response text containing JSON

Returns:

Dict[str, Any]: Parsed JSON object, or empty dict on failure

Logic:

Tries to extract and parse JSON:
Uses regex to find JSON object: r"\{.*\}" (DOTALL mode)
If JSON match found:
Imports json module
Parses matched string
Returns parsed dictionary
On any exception (no match, invalid JSON, etc.):
Returns empty dict {}

Robustness:

Extracts JSON from verbose responses (ignores surrounding text)
DOTALL regex mode handles multi-line JSON
Graceful fallback to empty dict
Lazy import of json module (only when needed)

Example:

response = '''The evaluation results:
{
    "accuracy": 0.9,
    "clarity": 0.85,
    "completeness": 0.95
}
Additional comments...'''

parse_json_response(response)
# → {"accuracy": 0.9, "clarity": 0.85, "completeness": 0.95}

parse_json_response("Not JSON")  # → {}
parse_json_response('{"key": "value"}')  # → {"key": "value"}

Design Patterns

Static Utility Classes

Both classes contain only static methods, acting as namespaces for related functions. This design:

Groups related functionality
Avoids unnecessary instantiation
Provides clear API surface
Enables easy importing (from utils import JudgePromptBuilder)

Template Method

Prompt builders follow a consistent pattern:

Check for custom prompt (early return if present)
Determine format-specific parameters
Format standard template
Return result

Defensive Parsing

Response parsers use try-except blocks and provide sensible defaults:

parse_binary_response: Defaults to 0/False
parse_score_response: Defaults to minimum score
parse_comparative_response: Returns (-1.0, -1.0) sentinel
parse_json_response: Returns empty dict

This ensures evaluation pipelines don't crash on unexpected formats.

Format Flexibility

Parsers handle multiple output variations:

Binary: "1", "[1]", "score: 1", "Yes", "yes, because..."
Scores: "8.5", "Score: 8.5", "The score is 8.5"
Comparative: "8 7", "8.5, 7.0", "Score: 9; Score: 6"
JSON: Embedded in text or standalone

Usage Patterns

Prompt Building Workflow

from lmms_eval.llm_judge.utils import JudgePromptBuilder

# Build prompt
prompt = JudgePromptBuilder.build_binary_prompt(
    question="What is H2O?",
    answer="Water",
    prediction="Water molecule",
    output_format="0/1"
)

# Send to judge
response = judge.call(prompt)

# Parse response
from lmms_eval.llm_judge.utils import ResponseParser
result = ResponseParser.parse_binary_response(response, "0/1")

Custom Prompt with Parsing

custom_template = """
Question: {question}
Expected: {answer}
Student Answer: {prediction}
Extra Context: {extra_info}

Is the student answer correct? Reply with 1 (yes) or 0 (no).
"""

prompt = JudgePromptBuilder.build_binary_prompt(
    question="...",
    answer="...",
    prediction="...",
    custom_prompt=custom_template,
    extra_info="Consider partial credit."  # **kwargs
)

response = judge.call(prompt)
result = ResponseParser.parse_binary_response(response, "0/1")

Comparative Evaluation Pipeline

# Build prompt
prompt = JudgePromptBuilder.build_comparative_prompt(
    question="Explain photosynthesis",
    response1=model_a_output,
    response2=model_b_output,
    score_range=(1, 10)
)

# Get evaluation
response = judge.call(prompt)

# Parse scores
score_a, score_b = ResponseParser.parse_comparative_response(response)

if score_a > 0 and score_b > 0:  # Valid parse
    winner = "Model A" if score_a > score_b else "Model B"
    print(f"Winner: {winner} ({score_a} vs {score_b})")
else:
    print("Failed to parse scores")

Rubric-based Evaluation

# Build prompt with rubric
rubric = {
    "accuracy": "Factual correctness",
    "clarity": "Clear explanation",
    "completeness": "Addresses all aspects"
}

rubric_text = "\n".join([f"- {k}: {v}" for k, v in rubric.items()])
prompt = f"""Evaluate the response on these criteria:
{rubric_text}

Response: {prediction}

Provide JSON with scores (0-1) for each criterion."""

# Get evaluation
response = judge.call(prompt)

# Parse JSON
scores = ResponseParser.parse_json_response(response)
print(f"Accuracy: {scores.get('accuracy', 'N/A')}")
print(f"Clarity: {scores.get('clarity', 'N/A')}")

Error Handling Best Practices

Check Parse Results

score1, score2 = ResponseParser.parse_comparative_response(response)
if score1 == -1.0 and score2 == -1.0:
    logger.warning(f"Failed to parse comparative scores from: {response}")
    # Fallback logic

Validate Score Ranges

score = ResponseParser.parse_score_response(response, score_range=(1, 10))
# score is guaranteed to be in [1, 10] range
assert 1 <= score <= 10

Handle JSON Parse Failures

scores = ResponseParser.parse_json_response(response)
if not scores:  # Empty dict indicates parse failure
    logger.error(f"Failed to parse JSON from: {response}")
    # Provide default scores or retry
    scores = {"accuracy": 0.5, "clarity": 0.5}

Integration with Base Classes

The utilities are used throughout the base evaluation methods:

# In ServerInterface.evaluate_binary()
prompt = JudgePromptBuilder.build_binary_prompt(...)
request = Request(messages=[{"role": "user", "content": prompt}], ...)
response = self.evaluate(request)
parsed_result = ResponseParser.parse_binary_response(response.content, output_format)

Related Implementations

LLM Judge Base: Uses these utilities in evaluation methods
LLM Judge Prompt Templates: Templates used by JudgePromptBuilder
LLM Judge Protocol: Request/Response types used with these utilities

Testing Considerations

Prompt Builder Tests

Test with and without custom prompts
Verify output_format affects positive/negative symbols
Confirm kwargs are passed through to custom prompts
Test context_section inclusion/exclusion

Response Parser Tests

Test various response formats (verbose, minimal, malformed)
Verify score clamping behavior
Test error cases (no numbers, invalid JSON)
Confirm sentinel values for parse failures

Best Practices

Always specify output_format explicitly for consistency
Use score_range for clamping in scoring tasks
Check for parse failure sentinels (-1.0, empty dict)
Log raw responses when parsing fails for debugging
Test parsers with real API responses before deployment
Consider retry logic if parsing fails (may be transient formatting issue)

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment