Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:EvolvingLMMs Lab Lmms eval LLM Judge Utils

From Leeroopedia

Overview

This implementation provides utility classes for building evaluation prompts and parsing judge responses. It bridges the gap between raw evaluation requirements and structured API calls, handling prompt template formatting and response extraction with robust error handling.

File Location

/tmp/kapso_repo_sslb_59s/lmms_eval/llm_judge/utils.py (115 lines)

Related Principle

LLM as Judge

Dependencies

  • re: Regular expression parsing
  • typing: Type hints
  • LLM Judge Prompt Templates: BINARY_JUDGE_PROMPT, COMPARATIVE_JUDGE_PROMPT, CORRECTNESS_JUDGE_PROMPT

Core Components

JudgePromptBuilder

Static helper class for building evaluation prompts from templates.

build_binary_prompt

@staticmethod
def build_binary_prompt(
    question: str,
    answer: str,
    prediction: str,
    output_format: str = "0/1",
    custom_prompt: Optional[str] = None,
    **kwargs
) -> str

Build prompt for binary correctness evaluation.

Parameters:

  • question (str): The question being evaluated
  • answer (str): Ground truth answer
  • prediction (str): Model's predicted answer to evaluate
  • output_format (str): Output format, "0/1" or "yes/no" (default: "0/1")
  • custom_prompt (Optional[str]): Custom prompt template (overrides default)
  • **kwargs: Additional formatting arguments

Returns:

  • str: Formatted evaluation prompt

Logic:

  1. If custom_prompt provided:
  2. Formats with question, answer, pred (alias), prediction, and kwargs
  3. Returns formatted custom prompt
  4. Otherwise:
  5. Determines positive/negative symbols from output_format:
  6. - "0/1" or "1/0" → positive="1", negative="0"
  7. - Otherwise → positive="Yes", negative="No"
  8. Formats BINARY_JUDGE_PROMPT with all parameters
  9. Returns formatted standard prompt

Example:

# Standard prompt
prompt = JudgePromptBuilder.build_binary_prompt(
    question="What is the capital of France?",
    answer="Paris",
    prediction="The capital is Paris",
    output_format="0/1"
)

# Custom prompt
custom = "Q: {question}\nA: {answer}\nP: {prediction}\nCorrect? (1/0)"
prompt = JudgePromptBuilder.build_binary_prompt(
    question="...",
    answer="...",
    prediction="...",
    custom_prompt=custom
)

build_comparative_prompt

@staticmethod
def build_comparative_prompt(
    question: str,
    response1: str,
    response2: str,
    context: Optional[str] = None,
    score_range: Tuple[int, int] = (1, 10),
    custom_prompt: Optional[str] = None,
    evaluation_instruction: Optional[str] = None,
    **kwargs
) -> str

Build prompt for comparative evaluation of two responses.

Parameters:

  • question (str): The question both responses address
  • response1 (str): First response to evaluate
  • response2 (str): Second response to evaluate
  • context (Optional[str]): Additional context for evaluation (default: None)
  • score_range (Tuple[int, int]): Min and max scores (default: (1, 10))
  • custom_prompt (Optional[str]): Custom prompt template
  • evaluation_instruction (Optional[str]): Custom evaluation instructions
  • **kwargs: Additional formatting arguments

Returns:

  • str: Formatted comparative evaluation prompt

Logic:

  1. If custom_prompt provided:
  2. Formats with question, response1, response2, context (or ""), and kwargs
  3. Returns formatted custom prompt
  4. Otherwise:
  5. Constructs context_section:
  6. - If context exists: "[Context]\n{context}\n\n"
  7. - Otherwise: ""
  8. If evaluation_instruction not provided:
  9. - Sets default: "Please provide scores from {min} to {max}."
  10. Formats COMPARATIVE_JUDGE_PROMPT with all parameters
  11. Returns formatted standard prompt

Example:

prompt = JudgePromptBuilder.build_comparative_prompt(
    question="Explain quantum computing",
    response1="Quantum computing uses qubits...",
    response2="Quantum computers leverage superposition...",
    context="For a high school audience",
    score_range=(1, 10),
    evaluation_instruction="Focus on clarity and accuracy."
)

build_correctness_prompt

@staticmethod
def build_correctness_prompt(
    question: str,
    answer: str,
    prediction: str,
    output_format: str = "yes/no",
    **kwargs
) -> str

Build prompt for mathematical/semantic correctness evaluation, ignoring formatting.

Parameters:

  • question (str): The question being evaluated
  • answer (str): Correct answer
  • prediction (str): Solution to evaluate
  • output_format (str): Output format, "yes/no" or "0/1" (default: "yes/no")
  • **kwargs: Additional formatting arguments (unused)

Returns:

  • str: Formatted correctness evaluation prompt

Logic:

  1. Determines positive/negative symbols:
  2. - "yes/no" → positive="Yes", negative="No"
  3. - Otherwise → positive="1", negative="0"
  4. Formats CORRECTNESS_JUDGE_PROMPT with question, answer, prediction, symbols
  5. Returns formatted prompt

Example:

prompt = JudgePromptBuilder.build_correctness_prompt(
    question="Solve: x^2 = 16",
    answer="x = 4 or x = -4",
    prediction="\\boxed{x = \\pm 4}",
    output_format="yes/no"
)

ResponseParser

Static helper class for parsing different types of judge responses.

parse_binary_response

@staticmethod
def parse_binary_response(response: str, output_format: str = "0/1") -> Union[int, bool]

Parse binary response (0/1 or yes/no) from judge output.

Parameters:

  • response (str): Raw response text from judge
  • output_format (str): Expected format, "0/1" or "yes/no" (default: "0/1")

Returns:

  • Union[int, bool]:
    • For "0/1" or "1/0": Returns 1 (correct) or 0 (incorrect)
    • For "yes/no": Returns True or False

Logic:

  1. Strips and lowercases response
  2. If output_format is "0/1" or "1/0":
  3. Checks if response contains any of: "1", "[1]", "score: 1", "answer: 1"
  4. Returns 1 if found, 0 otherwise
  5. Otherwise (yes/no format):
  6. Returns True if response == "yes" or starts with "yes"
  7. Returns False otherwise

Robustness:

  • Handles various formats: "1", "[1]", "score: 1", "answer: 1"
  • Case-insensitive matching
  • Flexible yes/no detection (prefix match)
  • Defaults to negative (0/False) for ambiguous responses

Example:

# Various formats
parse_binary_response("1", "0/1")                # → 1
parse_binary_response("[1]", "0/1")              # → 1
parse_binary_response("score: 1", "0/1")         # → 1
parse_binary_response("0", "0/1")                # → 0
parse_binary_response("Yes", "yes/no")           # → True
parse_binary_response("yes, correct", "yes/no")  # → True
parse_binary_response("No", "yes/no")            # → False

parse_score_response

@staticmethod
def parse_score_response(
    response: str,
    score_range: Optional[Tuple[float, float]] = None
) -> float

Parse a single numerical score from response text.

Parameters:

  • response (str): Raw response text containing a score
  • score_range (Optional[Tuple[float, float]]): Valid (min, max) range for clamping

Returns:

  • float: Extracted score (clamped to range if provided)

Logic:

  1. Tries to extract first number from response:
  2. Uses regex: r"-?\d+(?:\.\d+)?"
  3. Finds all matching numbers (including decimals and negatives)
  4. If numbers found:
  5. Converts first number to float
  6. If score_range provided:
  7. Clamps score to [min, max] range
  8. Returns score
  9. On any exception or no numbers found:
  10. Returns minimum score (score_range[0]) or 0.0 if no range

Robustness:

  • Extracts first number from verbose responses
  • Handles integers and decimals
  • Handles negative numbers
  • Clamps to valid range
  • Graceful fallback to minimum score

Example:

parse_score_response("Score: 8.5", (1, 10))           # → 8.5
parse_score_response("The score is 12", (1, 10))      # → 10.0 (clamped)
parse_score_response("Rating: -5", (1, 10))           # → 1.0 (clamped)
parse_score_response("No valid score", (1, 10))       # → 1.0 (fallback)
parse_score_response("3.14159")                       # → 3.14159

parse_comparative_response

@staticmethod
def parse_comparative_response(response: str) -> Tuple[float, float]

Parse two comparative scores from response text.

Parameters:

  • response (str): Raw response text (expected format: "score1 score2" on first line)

Returns:

  • Tuple[float, float]: (score1, score2) for the two responses being compared

Logic:

  1. Tries to extract scores:
  2. Splits response into lines
  3. Gets first line (score line)
  4. Normalizes separators: replaces "," and ";" with " "
  5. Extracts all numbers with regex: r"-?\d+(?:\.\d+)?"
  6. If at least 2 numbers found:
  7. Returns (first_number, second_number) as floats
  8. On any exception or insufficient numbers:
  9. Returns (-1.0, -1.0) as error sentinel

Robustness:

  • Handles multiple separator formats (space, comma, semicolon)
  • Focuses on first line (ignores explanation text)
  • Extracts decimals and negatives
  • Clear error sentinel (-1.0, -1.0) for parse failures

Example:

response = """8 7
Assistant 1 provided more detail and accuracy..."""
parse_comparative_response(response)              # → (8.0, 7.0)

parse_comparative_response("8.5, 7.0")            # → (8.5, 7.0)
parse_comparative_response("Score: 9; Score: 6")  # → (9.0, 6.0)
parse_comparative_response("Invalid")             # → (-1.0, -1.0)

parse_json_response

@staticmethod
def parse_json_response(response: str) -> Dict[str, Any]

Parse JSON-formatted response from judge output.

Parameters:

  • response (str): Raw response text containing JSON

Returns:

  • Dict[str, Any]: Parsed JSON object, or empty dict on failure

Logic:

  1. Tries to extract and parse JSON:
  2. Uses regex to find JSON object: r"\{.*\}" (DOTALL mode)
  3. If JSON match found:
  4. Imports json module
  5. Parses matched string
  6. Returns parsed dictionary
  7. On any exception (no match, invalid JSON, etc.):
  8. Returns empty dict {}

Robustness:

  • Extracts JSON from verbose responses (ignores surrounding text)
  • DOTALL regex mode handles multi-line JSON
  • Graceful fallback to empty dict
  • Lazy import of json module (only when needed)

Example:

response = '''The evaluation results:
{
    "accuracy": 0.9,
    "clarity": 0.85,
    "completeness": 0.95
}
Additional comments...'''

parse_json_response(response)
# → {"accuracy": 0.9, "clarity": 0.85, "completeness": 0.95}

parse_json_response("Not JSON")  # → {}
parse_json_response('{"key": "value"}')  # → {"key": "value"}

Design Patterns

Static Utility Classes

Both classes contain only static methods, acting as namespaces for related functions. This design:

  • Groups related functionality
  • Avoids unnecessary instantiation
  • Provides clear API surface
  • Enables easy importing (from utils import JudgePromptBuilder)

Template Method

Prompt builders follow a consistent pattern:

  1. Check for custom prompt (early return if present)
  2. Determine format-specific parameters
  3. Format standard template
  4. Return result

Defensive Parsing

Response parsers use try-except blocks and provide sensible defaults:

  • parse_binary_response: Defaults to 0/False
  • parse_score_response: Defaults to minimum score
  • parse_comparative_response: Returns (-1.0, -1.0) sentinel
  • parse_json_response: Returns empty dict

This ensures evaluation pipelines don't crash on unexpected formats.

Format Flexibility

Parsers handle multiple output variations:

  • Binary: "1", "[1]", "score: 1", "Yes", "yes, because..."
  • Scores: "8.5", "Score: 8.5", "The score is 8.5"
  • Comparative: "8 7", "8.5, 7.0", "Score: 9; Score: 6"
  • JSON: Embedded in text or standalone

Usage Patterns

Prompt Building Workflow

from lmms_eval.llm_judge.utils import JudgePromptBuilder

# Build prompt
prompt = JudgePromptBuilder.build_binary_prompt(
    question="What is H2O?",
    answer="Water",
    prediction="Water molecule",
    output_format="0/1"
)

# Send to judge
response = judge.call(prompt)

# Parse response
from lmms_eval.llm_judge.utils import ResponseParser
result = ResponseParser.parse_binary_response(response, "0/1")

Custom Prompt with Parsing

custom_template = """
Question: {question}
Expected: {answer}
Student Answer: {prediction}
Extra Context: {extra_info}

Is the student answer correct? Reply with 1 (yes) or 0 (no).
"""

prompt = JudgePromptBuilder.build_binary_prompt(
    question="...",
    answer="...",
    prediction="...",
    custom_prompt=custom_template,
    extra_info="Consider partial credit."  # **kwargs
)

response = judge.call(prompt)
result = ResponseParser.parse_binary_response(response, "0/1")

Comparative Evaluation Pipeline

# Build prompt
prompt = JudgePromptBuilder.build_comparative_prompt(
    question="Explain photosynthesis",
    response1=model_a_output,
    response2=model_b_output,
    score_range=(1, 10)
)

# Get evaluation
response = judge.call(prompt)

# Parse scores
score_a, score_b = ResponseParser.parse_comparative_response(response)

if score_a > 0 and score_b > 0:  # Valid parse
    winner = "Model A" if score_a > score_b else "Model B"
    print(f"Winner: {winner} ({score_a} vs {score_b})")
else:
    print("Failed to parse scores")

Rubric-based Evaluation

# Build prompt with rubric
rubric = {
    "accuracy": "Factual correctness",
    "clarity": "Clear explanation",
    "completeness": "Addresses all aspects"
}

rubric_text = "\n".join([f"- {k}: {v}" for k, v in rubric.items()])
prompt = f"""Evaluate the response on these criteria:
{rubric_text}

Response: {prediction}

Provide JSON with scores (0-1) for each criterion."""

# Get evaluation
response = judge.call(prompt)

# Parse JSON
scores = ResponseParser.parse_json_response(response)
print(f"Accuracy: {scores.get('accuracy', 'N/A')}")
print(f"Clarity: {scores.get('clarity', 'N/A')}")

Error Handling Best Practices

Check Parse Results

score1, score2 = ResponseParser.parse_comparative_response(response)
if score1 == -1.0 and score2 == -1.0:
    logger.warning(f"Failed to parse comparative scores from: {response}")
    # Fallback logic

Validate Score Ranges

score = ResponseParser.parse_score_response(response, score_range=(1, 10))
# score is guaranteed to be in [1, 10] range
assert 1 <= score <= 10

Handle JSON Parse Failures

scores = ResponseParser.parse_json_response(response)
if not scores:  # Empty dict indicates parse failure
    logger.error(f"Failed to parse JSON from: {response}")
    # Provide default scores or retry
    scores = {"accuracy": 0.5, "clarity": 0.5}

Integration with Base Classes

The utilities are used throughout the base evaluation methods:

# In ServerInterface.evaluate_binary()
prompt = JudgePromptBuilder.build_binary_prompt(...)
request = Request(messages=[{"role": "user", "content": prompt}], ...)
response = self.evaluate(request)
parsed_result = ResponseParser.parse_binary_response(response.content, output_format)

Related Implementations

Testing Considerations

Prompt Builder Tests

  • Test with and without custom prompts
  • Verify output_format affects positive/negative symbols
  • Confirm kwargs are passed through to custom prompts
  • Test context_section inclusion/exclusion

Response Parser Tests

  • Test various response formats (verbose, minimal, malformed)
  • Verify score clamping behavior
  • Test error cases (no numbers, invalid JSON)
  • Confirm sentinel values for parse failures

Best Practices

  1. Always specify output_format explicitly for consistency
  2. Use score_range for clamping in scoring tasks
  3. Check for parse failure sentinels (-1.0, empty dict)
  4. Log raw responses when parsing fails for debugging
  5. Test parsers with real API responses before deployment
  6. Consider retry logic if parsing fails (may be transient formatting issue)

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment