Implementation:EvolvingLMMs Lab Lmms eval LLM Judge Utils
Overview
This implementation provides utility classes for building evaluation prompts and parsing judge responses. It bridges the gap between raw evaluation requirements and structured API calls, handling prompt template formatting and response extraction with robust error handling.
File Location
/tmp/kapso_repo_sslb_59s/lmms_eval/llm_judge/utils.py (115 lines)
Related Principle
Dependencies
re: Regular expression parsingtyping: Type hints- LLM Judge Prompt Templates: BINARY_JUDGE_PROMPT, COMPARATIVE_JUDGE_PROMPT, CORRECTNESS_JUDGE_PROMPT
Core Components
JudgePromptBuilder
Static helper class for building evaluation prompts from templates.
build_binary_prompt
@staticmethod
def build_binary_prompt(
question: str,
answer: str,
prediction: str,
output_format: str = "0/1",
custom_prompt: Optional[str] = None,
**kwargs
) -> str
Build prompt for binary correctness evaluation.
Parameters:
question(str): The question being evaluatedanswer(str): Ground truth answerprediction(str): Model's predicted answer to evaluateoutput_format(str): Output format, "0/1" or "yes/no" (default: "0/1")custom_prompt(Optional[str]): Custom prompt template (overrides default)**kwargs: Additional formatting arguments
Returns:
- str: Formatted evaluation prompt
Logic:
- If custom_prompt provided:
- Formats with question, answer, pred (alias), prediction, and kwargs
- Returns formatted custom prompt
- Otherwise:
- Determines positive/negative symbols from output_format:
- - "0/1" or "1/0" → positive="1", negative="0"
- - Otherwise → positive="Yes", negative="No"
- Formats BINARY_JUDGE_PROMPT with all parameters
- Returns formatted standard prompt
Example:
# Standard prompt
prompt = JudgePromptBuilder.build_binary_prompt(
question="What is the capital of France?",
answer="Paris",
prediction="The capital is Paris",
output_format="0/1"
)
# Custom prompt
custom = "Q: {question}\nA: {answer}\nP: {prediction}\nCorrect? (1/0)"
prompt = JudgePromptBuilder.build_binary_prompt(
question="...",
answer="...",
prediction="...",
custom_prompt=custom
)
build_comparative_prompt
@staticmethod
def build_comparative_prompt(
question: str,
response1: str,
response2: str,
context: Optional[str] = None,
score_range: Tuple[int, int] = (1, 10),
custom_prompt: Optional[str] = None,
evaluation_instruction: Optional[str] = None,
**kwargs
) -> str
Build prompt for comparative evaluation of two responses.
Parameters:
question(str): The question both responses addressresponse1(str): First response to evaluateresponse2(str): Second response to evaluatecontext(Optional[str]): Additional context for evaluation (default: None)score_range(Tuple[int, int]): Min and max scores (default: (1, 10))custom_prompt(Optional[str]): Custom prompt templateevaluation_instruction(Optional[str]): Custom evaluation instructions**kwargs: Additional formatting arguments
Returns:
- str: Formatted comparative evaluation prompt
Logic:
- If custom_prompt provided:
- Formats with question, response1, response2, context (or ""), and kwargs
- Returns formatted custom prompt
- Otherwise:
- Constructs context_section:
- - If context exists: "[Context]\n{context}\n\n"
- - Otherwise: ""
- If evaluation_instruction not provided:
- - Sets default: "Please provide scores from {min} to {max}."
- Formats COMPARATIVE_JUDGE_PROMPT with all parameters
- Returns formatted standard prompt
Example:
prompt = JudgePromptBuilder.build_comparative_prompt(
question="Explain quantum computing",
response1="Quantum computing uses qubits...",
response2="Quantum computers leverage superposition...",
context="For a high school audience",
score_range=(1, 10),
evaluation_instruction="Focus on clarity and accuracy."
)
build_correctness_prompt
@staticmethod
def build_correctness_prompt(
question: str,
answer: str,
prediction: str,
output_format: str = "yes/no",
**kwargs
) -> str
Build prompt for mathematical/semantic correctness evaluation, ignoring formatting.
Parameters:
question(str): The question being evaluatedanswer(str): Correct answerprediction(str): Solution to evaluateoutput_format(str): Output format, "yes/no" or "0/1" (default: "yes/no")**kwargs: Additional formatting arguments (unused)
Returns:
- str: Formatted correctness evaluation prompt
Logic:
- Determines positive/negative symbols:
- - "yes/no" → positive="Yes", negative="No"
- - Otherwise → positive="1", negative="0"
- Formats CORRECTNESS_JUDGE_PROMPT with question, answer, prediction, symbols
- Returns formatted prompt
Example:
prompt = JudgePromptBuilder.build_correctness_prompt(
question="Solve: x^2 = 16",
answer="x = 4 or x = -4",
prediction="\\boxed{x = \\pm 4}",
output_format="yes/no"
)
ResponseParser
Static helper class for parsing different types of judge responses.
parse_binary_response
@staticmethod
def parse_binary_response(response: str, output_format: str = "0/1") -> Union[int, bool]
Parse binary response (0/1 or yes/no) from judge output.
Parameters:
response(str): Raw response text from judgeoutput_format(str): Expected format, "0/1" or "yes/no" (default: "0/1")
Returns:
- Union[int, bool]:
- For "0/1" or "1/0": Returns 1 (correct) or 0 (incorrect)
- For "yes/no": Returns True or False
Logic:
- Strips and lowercases response
- If output_format is "0/1" or "1/0":
- Checks if response contains any of: "1", "[1]", "score: 1", "answer: 1"
- Returns 1 if found, 0 otherwise
- Otherwise (yes/no format):
- Returns True if response == "yes" or starts with "yes"
- Returns False otherwise
Robustness:
- Handles various formats: "1", "[1]", "score: 1", "answer: 1"
- Case-insensitive matching
- Flexible yes/no detection (prefix match)
- Defaults to negative (0/False) for ambiguous responses
Example:
# Various formats
parse_binary_response("1", "0/1") # → 1
parse_binary_response("[1]", "0/1") # → 1
parse_binary_response("score: 1", "0/1") # → 1
parse_binary_response("0", "0/1") # → 0
parse_binary_response("Yes", "yes/no") # → True
parse_binary_response("yes, correct", "yes/no") # → True
parse_binary_response("No", "yes/no") # → False
parse_score_response
@staticmethod
def parse_score_response(
response: str,
score_range: Optional[Tuple[float, float]] = None
) -> float
Parse a single numerical score from response text.
Parameters:
response(str): Raw response text containing a scorescore_range(Optional[Tuple[float, float]]): Valid (min, max) range for clamping
Returns:
- float: Extracted score (clamped to range if provided)
Logic:
- Tries to extract first number from response:
- Uses regex: r"-?\d+(?:\.\d+)?"
- Finds all matching numbers (including decimals and negatives)
- If numbers found:
- Converts first number to float
- If score_range provided:
- Clamps score to [min, max] range
- Returns score
- On any exception or no numbers found:
- Returns minimum score (score_range[0]) or 0.0 if no range
Robustness:
- Extracts first number from verbose responses
- Handles integers and decimals
- Handles negative numbers
- Clamps to valid range
- Graceful fallback to minimum score
Example:
parse_score_response("Score: 8.5", (1, 10)) # → 8.5
parse_score_response("The score is 12", (1, 10)) # → 10.0 (clamped)
parse_score_response("Rating: -5", (1, 10)) # → 1.0 (clamped)
parse_score_response("No valid score", (1, 10)) # → 1.0 (fallback)
parse_score_response("3.14159") # → 3.14159
parse_comparative_response
@staticmethod
def parse_comparative_response(response: str) -> Tuple[float, float]
Parse two comparative scores from response text.
Parameters:
response(str): Raw response text (expected format: "score1 score2" on first line)
Returns:
- Tuple[float, float]: (score1, score2) for the two responses being compared
Logic:
- Tries to extract scores:
- Splits response into lines
- Gets first line (score line)
- Normalizes separators: replaces "," and ";" with " "
- Extracts all numbers with regex: r"-?\d+(?:\.\d+)?"
- If at least 2 numbers found:
- Returns (first_number, second_number) as floats
- On any exception or insufficient numbers:
- Returns (-1.0, -1.0) as error sentinel
Robustness:
- Handles multiple separator formats (space, comma, semicolon)
- Focuses on first line (ignores explanation text)
- Extracts decimals and negatives
- Clear error sentinel (-1.0, -1.0) for parse failures
Example:
response = """8 7
Assistant 1 provided more detail and accuracy..."""
parse_comparative_response(response) # → (8.0, 7.0)
parse_comparative_response("8.5, 7.0") # → (8.5, 7.0)
parse_comparative_response("Score: 9; Score: 6") # → (9.0, 6.0)
parse_comparative_response("Invalid") # → (-1.0, -1.0)
parse_json_response
@staticmethod
def parse_json_response(response: str) -> Dict[str, Any]
Parse JSON-formatted response from judge output.
Parameters:
response(str): Raw response text containing JSON
Returns:
- Dict[str, Any]: Parsed JSON object, or empty dict on failure
Logic:
- Tries to extract and parse JSON:
- Uses regex to find JSON object: r"\{.*\}" (DOTALL mode)
- If JSON match found:
- Imports json module
- Parses matched string
- Returns parsed dictionary
- On any exception (no match, invalid JSON, etc.):
- Returns empty dict {}
Robustness:
- Extracts JSON from verbose responses (ignores surrounding text)
- DOTALL regex mode handles multi-line JSON
- Graceful fallback to empty dict
- Lazy import of json module (only when needed)
Example:
response = '''The evaluation results:
{
"accuracy": 0.9,
"clarity": 0.85,
"completeness": 0.95
}
Additional comments...'''
parse_json_response(response)
# → {"accuracy": 0.9, "clarity": 0.85, "completeness": 0.95}
parse_json_response("Not JSON") # → {}
parse_json_response('{"key": "value"}') # → {"key": "value"}
Design Patterns
Static Utility Classes
Both classes contain only static methods, acting as namespaces for related functions. This design:
- Groups related functionality
- Avoids unnecessary instantiation
- Provides clear API surface
- Enables easy importing (from utils import JudgePromptBuilder)
Template Method
Prompt builders follow a consistent pattern:
- Check for custom prompt (early return if present)
- Determine format-specific parameters
- Format standard template
- Return result
Defensive Parsing
Response parsers use try-except blocks and provide sensible defaults:
- parse_binary_response: Defaults to 0/False
- parse_score_response: Defaults to minimum score
- parse_comparative_response: Returns (-1.0, -1.0) sentinel
- parse_json_response: Returns empty dict
This ensures evaluation pipelines don't crash on unexpected formats.
Format Flexibility
Parsers handle multiple output variations:
- Binary: "1", "[1]", "score: 1", "Yes", "yes, because..."
- Scores: "8.5", "Score: 8.5", "The score is 8.5"
- Comparative: "8 7", "8.5, 7.0", "Score: 9; Score: 6"
- JSON: Embedded in text or standalone
Usage Patterns
Prompt Building Workflow
from lmms_eval.llm_judge.utils import JudgePromptBuilder
# Build prompt
prompt = JudgePromptBuilder.build_binary_prompt(
question="What is H2O?",
answer="Water",
prediction="Water molecule",
output_format="0/1"
)
# Send to judge
response = judge.call(prompt)
# Parse response
from lmms_eval.llm_judge.utils import ResponseParser
result = ResponseParser.parse_binary_response(response, "0/1")
Custom Prompt with Parsing
custom_template = """
Question: {question}
Expected: {answer}
Student Answer: {prediction}
Extra Context: {extra_info}
Is the student answer correct? Reply with 1 (yes) or 0 (no).
"""
prompt = JudgePromptBuilder.build_binary_prompt(
question="...",
answer="...",
prediction="...",
custom_prompt=custom_template,
extra_info="Consider partial credit." # **kwargs
)
response = judge.call(prompt)
result = ResponseParser.parse_binary_response(response, "0/1")
Comparative Evaluation Pipeline
# Build prompt
prompt = JudgePromptBuilder.build_comparative_prompt(
question="Explain photosynthesis",
response1=model_a_output,
response2=model_b_output,
score_range=(1, 10)
)
# Get evaluation
response = judge.call(prompt)
# Parse scores
score_a, score_b = ResponseParser.parse_comparative_response(response)
if score_a > 0 and score_b > 0: # Valid parse
winner = "Model A" if score_a > score_b else "Model B"
print(f"Winner: {winner} ({score_a} vs {score_b})")
else:
print("Failed to parse scores")
Rubric-based Evaluation
# Build prompt with rubric
rubric = {
"accuracy": "Factual correctness",
"clarity": "Clear explanation",
"completeness": "Addresses all aspects"
}
rubric_text = "\n".join([f"- {k}: {v}" for k, v in rubric.items()])
prompt = f"""Evaluate the response on these criteria:
{rubric_text}
Response: {prediction}
Provide JSON with scores (0-1) for each criterion."""
# Get evaluation
response = judge.call(prompt)
# Parse JSON
scores = ResponseParser.parse_json_response(response)
print(f"Accuracy: {scores.get('accuracy', 'N/A')}")
print(f"Clarity: {scores.get('clarity', 'N/A')}")
Error Handling Best Practices
Check Parse Results
score1, score2 = ResponseParser.parse_comparative_response(response)
if score1 == -1.0 and score2 == -1.0:
logger.warning(f"Failed to parse comparative scores from: {response}")
# Fallback logic
Validate Score Ranges
score = ResponseParser.parse_score_response(response, score_range=(1, 10))
# score is guaranteed to be in [1, 10] range
assert 1 <= score <= 10
Handle JSON Parse Failures
scores = ResponseParser.parse_json_response(response)
if not scores: # Empty dict indicates parse failure
logger.error(f"Failed to parse JSON from: {response}")
# Provide default scores or retry
scores = {"accuracy": 0.5, "clarity": 0.5}
Integration with Base Classes
The utilities are used throughout the base evaluation methods:
# In ServerInterface.evaluate_binary()
prompt = JudgePromptBuilder.build_binary_prompt(...)
request = Request(messages=[{"role": "user", "content": prompt}], ...)
response = self.evaluate(request)
parsed_result = ResponseParser.parse_binary_response(response.content, output_format)
Related Implementations
- LLM Judge Base: Uses these utilities in evaluation methods
- LLM Judge Prompt Templates: Templates used by JudgePromptBuilder
- LLM Judge Protocol: Request/Response types used with these utilities
Testing Considerations
Prompt Builder Tests
- Test with and without custom prompts
- Verify output_format affects positive/negative symbols
- Confirm kwargs are passed through to custom prompts
- Test context_section inclusion/exclusion
Response Parser Tests
- Test various response formats (verbose, minimal, malformed)
- Verify score clamping behavior
- Test error cases (no numbers, invalid JSON)
- Confirm sentinel values for parse failures
Best Practices
- Always specify output_format explicitly for consistency
- Use score_range for clamping in scoring tasks
- Check for parse failure sentinels (-1.0, empty dict)
- Log raw responses when parsing fails for debugging
- Test parsers with real API responses before deployment
- Consider retry logic if parsing fails (may be transient formatting issue)