Implementation:EvolvingLMMs Lab Lmms eval LLM Judge Base
Overview
This implementation provides abstract base classes that define the core interfaces and common functionality for LLM judge implementations. It establishes contracts for evaluation methods and provides reusable implementations for binary, comparative, and rubric-based evaluation patterns.
File Location
/tmp/kapso_repo_sslb_59s/lmms_eval/llm_judge/base.py (241 lines)
Related Principle
Dependencies
abc: Abstract base class supportasyncio: Asynchronous evaluation supporttyping: Type hints for function signatures- LLM Judge Protocol: Request, Response, ServerConfig types
- LLM Judge Utils: JudgePromptBuilder and ResponseParser
Core Components
ServerInterface
Abstract base class defining the core judge interface.
Constructor
def __init__(self, config: Optional[ServerConfig] = None)
Parameters:
config(Optional[ServerConfig]): Configuration for the judge, defaults to ServerConfig with gpt-4
Attributes:
self.config: Stored configuration instance
Abstract Methods
evaluate
@abc.abstractmethod
def evaluate(self, request: Request) -> Response
Evaluate a request and return a response. Must be implemented by subclasses.
Parameters:
request(Request): JudgeRequest containing evaluation context
Returns:
- Response: JudgeResponse with evaluation result
is_available
@abc.abstractmethod
def is_available(self) -> bool
Check if the judge service is available. Must be implemented by subclasses.
Returns:
- bool: True if service is available
Helper Methods
prepare_messages
def prepare_messages(self, request: Request) -> List[Dict[str, Any]]
Prepare messages in the format expected by the API. Adds system prompt if configured and not already present.
Parameters:
request(Request): The evaluation request
Returns:
- List[Dict[str, Any]]: Messages with system prompt prepended if needed
Evaluation Methods
evaluate_binary
def evaluate_binary(
self,
question: str,
answer: str,
prediction: str,
output_format: str = "0/1",
custom_prompt: Optional[str] = None,
**kwargs
) -> Dict[str, Any]
Evaluate binary correctness of a prediction against an answer.
Parameters:
question(str): The question being evaluatedanswer(str): Ground truth answerprediction(str): Model's predicted answeroutput_format(str): Format for output, "0/1" or "yes/no" (default: "0/1")custom_prompt(Optional[str]): Custom evaluation prompt**kwargs: Additional prompt formatting arguments
Returns:
- Dict[str, Any]: Contains:
result: Parsed binary result (1/0 or True/False)raw_response: Raw judge response textmodel: Model name used for evaluationprompt: The complete evaluation promptsuccess: Whether evaluation succeeded
Process:
- Builds binary prompt using JudgePromptBuilder
- Creates Request with prompt
- Calls abstract evaluate() method
- Parses response using ResponseParser
- Returns structured result
evaluate_comparative
def evaluate_comparative(
self,
question: str,
response1: str,
response2: str,
context: Optional[str] = None,
score_range: Tuple[int, int] = (1, 10),
custom_prompt: Optional[str] = None,
images: Optional[List[Union[str, bytes]]] = None,
**kwargs
) -> Dict[str, Any]
Evaluate and compare two responses to the same question.
Parameters:
question(str): The question both responses addressresponse1(str): First response to evaluateresponse2(str): Second response to evaluatecontext(Optional[str]): Additional context for evaluationscore_range(Tuple[int, int]): Min and max scores (default: (1, 10))custom_prompt(Optional[str]): Custom evaluation promptimages(Optional[List[Union[str, bytes]]]): Images for context**kwargs: Additional prompt formatting arguments
Returns:
- Dict[str, Any]: Contains:
scores: Tuple of (score1, score2)raw_response: Raw judge responsemodel: Model usedprompt: Evaluation promptsuccess: Success status
evaluate_with_rubric
def evaluate_with_rubric(
self,
question: str,
prediction: str,
rubric: Dict[str, Any],
**kwargs
) -> Dict[str, Any]
Evaluate a response using a custom rubric with multiple criteria.
Parameters:
question(str): The question being evaluatedprediction(str): Response to evaluaterubric(Dict[str, Any]): Dictionary of criterion_name: description**kwargs: Additional arguments
Returns:
- Dict[str, Any]: Contains:
scores: Parsed JSON with scores for each rubric itemraw_response: Raw responsemodel: Model usedprompt: Evaluation promptsuccess: Success status
Process:
- Formats rubric as bullet list
- Constructs evaluation prompt with question, response, and rubric
- Requests JSON-formatted response
- Parses JSON scores using ResponseParser
AsyncServerInterface
Extends ServerInterface to provide asynchronous evaluation capabilities for high-throughput batch processing.
Constructor
def __init__(self, config: Optional[ServerConfig] = None)
Parameters:
config(Optional[ServerConfig]): Configuration including max_concurrent limit
Attributes:
self.semaphore: asyncio.Semaphore controlling concurrent evaluations (limit from config.max_concurrent)
Abstract Async Methods
evaluate_async
@abc.abstractmethod
async def evaluate_async(self, request: Request) -> Response
Asynchronously evaluate a request. Must be implemented by subclasses.
Parameters:
request(Request): Evaluation request
Returns:
- Response: Evaluation response
Sync Wrapper
evaluate
def evaluate(self, request: Request) -> Response
Synchronous wrapper for async evaluation using event loop.
Parameters:
request(Request): Evaluation request
Returns:
- Response: Evaluation response
Implementation:
- Gets current event loop
- Runs evaluate_async() to completion
- Returns result
Batch Processing
evaluate_batch
async def evaluate_batch(self, requests: List[Request]) -> List[Response]
Evaluate multiple requests concurrently with semaphore-based throttling.
Parameters:
requests(List[Request]): List of evaluation requests
Returns:
- List[Response]: Responses in same order as requests
Implementation:
- Creates async tasks for all requests
- Uses asyncio.gather() for concurrent execution
- Semaphore (from constructor) limits concurrency to config.max_concurrent
Async Evaluation Methods
evaluate_binary_async
async def evaluate_binary_async(
self,
question: str,
answer: str,
prediction: str,
output_format: str = "0/1",
custom_prompt: Optional[str] = None,
**kwargs
) -> Dict[str, Any]
Asynchronously evaluate binary correctness. Same parameters and return format as synchronous evaluate_binary().
evaluate_binary_batch_async
async def evaluate_binary_batch_async(
self,
questions: List[str],
answers: List[str],
predictions: List[str],
output_format: str = "0/1",
custom_prompt: Optional[str] = None,
**kwargs
) -> List[Dict[str, Any]]
Asynchronously evaluate multiple binary correctness tasks in batch.
Parameters:
questions(List[str]): List of questionsanswers(List[str]): List of ground truth answerspredictions(List[str]): List of predictions to evaluateoutput_format(str): Output format (default: "0/1")custom_prompt(Optional[str]): Custom evaluation prompt**kwargs: Additional arguments
Returns:
- List[Dict[str, Any]]: List of evaluation results
Validation:
- Raises ValueError if input lists have different lengths
evaluate_comparative_async
async def evaluate_comparative_async(
self,
question: str,
response1: str,
response2: str,
context: Optional[str] = None,
score_range: Tuple[int, int] = (1, 10),
custom_prompt: Optional[str] = None,
images: Optional[List[Union[str, bytes]]] = None,
**kwargs
) -> Dict[str, Any]
Asynchronously evaluate comparative responses. Same parameters and return format as synchronous evaluate_comparative().
evaluate_comparative_batch_async
async def evaluate_comparative_batch_async(
self,
questions: List[str],
responses1: List[str],
responses2: List[str],
contexts: Optional[List[Optional[str]]] = None,
score_range: Tuple[int, int] = (1, 10),
custom_prompt: Optional[str] = None,
images_list: Optional[List[Optional[List[Union[str, bytes]]]]] = None,
**kwargs
) -> List[Dict[str, Any]]
Asynchronously evaluate multiple comparative response tasks in batch.
Parameters:
questions(List[str]): List of questionsresponses1(List[str]): List of first responsesresponses2(List[str]): List of second responsescontexts(Optional[List[Optional[str]]]): List of contexts (defaults to [None] * len(questions))score_range(Tuple[int, int]): Score range (default: (1, 10))custom_prompt(Optional[str]): Custom promptimages_list(Optional[List[Optional[List[Union[str, bytes]]]]]): List of image lists (defaults to [None] * len(questions))**kwargs: Additional arguments
Returns:
- List[Dict[str, Any]]: List of evaluation results
Validation:
- Raises ValueError if questions and responses lists have different lengths
- Fills in None defaults for contexts and images_list if not provided
evaluate_with_rubric_async
async def evaluate_with_rubric_async(
self,
question: str,
prediction: str,
rubric: Dict[str, Any],
**kwargs
) -> Dict[str, Any]
Asynchronously evaluate with custom rubric. Same parameters and return format as synchronous evaluate_with_rubric().
Design Patterns
Template Method Pattern
Base classes implement high-level evaluation logic (prompt building, request creation, response parsing) while leaving low-level API calls (evaluate, evaluate_async) to subclasses.
Adapter Pattern
Synchronous evaluate() method adapts async evaluate_async() for sync contexts using event loop wrapper.
Semaphore Pattern
AsyncServerInterface uses semaphore to limit concurrent API requests, preventing rate limit issues and resource exhaustion.
Builder Pattern
Delegates prompt construction to JudgePromptBuilder for maintainability and reusability.
Usage Example
Implementing a Custom Judge
from lmms_eval.llm_judge.base import ServerInterface
from lmms_eval.llm_judge.protocol import Request, Response, ServerConfig
class CustomJudge(ServerInterface):
def __init__(self, config: Optional[ServerConfig] = None):
super().__init__(config)
# Initialize API client, etc.
def evaluate(self, request: Request) -> Response:
# Call custom API
messages = self.prepare_messages(request)
api_response = self.api_client.call(messages)
return Response(
content=api_response.text,
model_used=self.config.model_name,
success=True
)
def is_available(self) -> bool:
# Check service health
return self.api_client.ping()
# Use the judge
judge = CustomJudge(config=ServerConfig(model_name="my-model"))
result = judge.evaluate_binary(
question="What is the capital of France?",
answer="Paris",
prediction="The capital is Paris"
)
print(result["result"]) # 1 (correct)
Async Batch Evaluation
from lmms_eval.llm_judge.base import AsyncServerInterface
class AsyncCustomJudge(AsyncServerInterface):
async def evaluate_async(self, request: Request) -> Response:
async with self.semaphore: # Respects max_concurrent
messages = self.prepare_messages(request)
response = await self.api_client.call_async(messages)
return Response(
content=response.text,
model_used=self.config.model_name,
success=True
)
def is_available(self) -> bool:
return True
# Batch evaluation
config = ServerConfig(model_name="my-model", max_concurrent=5)
judge = AsyncCustomJudge(config)
questions = ["Q1", "Q2", "Q3", ...]
answers = ["A1", "A2", "A3", ...]
predictions = ["P1", "P2", "P3", ...]
results = await judge.evaluate_binary_batch_async(
questions, answers, predictions
)
Extension Points
Subclasses must implement:
evaluate()orevaluate_async(): Core evaluation logicis_available(): Service health check
Subclasses may override:
prepare_messages(): Custom message formatting- Configuration handling in constructor
- Error handling and retry logic
Related Implementations
- LLM Judge Factory: Creates instances of judge implementations
- LLM Judge Protocol: Defines Request, Response, ServerConfig types
- LLM Judge Utils: Provides JudgePromptBuilder and ResponseParser
- LLM Judge Prompt Templates: Prompt templates used by evaluation methods
Best Practices
- Always use semaphore in async implementations to respect rate limits
- Validate input list lengths in batch methods to fail fast
- Use prepare_messages() to ensure consistent message formatting
- Return structured results with success flags for robust error handling
- Leverage built-in prompt builders for consistency across implementations