Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:EvolvingLMMs Lab Lmms eval LLM Judge Base

From Leeroopedia
Revision as of 12:31, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/EvolvingLMMs_Lab_Lmms_eval_LLM_Judge_Base.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Overview

This implementation provides abstract base classes that define the core interfaces and common functionality for LLM judge implementations. It establishes contracts for evaluation methods and provides reusable implementations for binary, comparative, and rubric-based evaluation patterns.

File Location

/tmp/kapso_repo_sslb_59s/lmms_eval/llm_judge/base.py (241 lines)

Related Principle

LLM as Judge

Dependencies

  • abc: Abstract base class support
  • asyncio: Asynchronous evaluation support
  • typing: Type hints for function signatures
  • LLM Judge Protocol: Request, Response, ServerConfig types
  • LLM Judge Utils: JudgePromptBuilder and ResponseParser

Core Components

ServerInterface

Abstract base class defining the core judge interface.

Constructor

def __init__(self, config: Optional[ServerConfig] = None)

Parameters:

  • config (Optional[ServerConfig]): Configuration for the judge, defaults to ServerConfig with gpt-4

Attributes:

  • self.config: Stored configuration instance

Abstract Methods

evaluate

@abc.abstractmethod
def evaluate(self, request: Request) -> Response

Evaluate a request and return a response. Must be implemented by subclasses.

Parameters:

  • request (Request): JudgeRequest containing evaluation context

Returns:

  • Response: JudgeResponse with evaluation result

is_available

@abc.abstractmethod
def is_available(self) -> bool

Check if the judge service is available. Must be implemented by subclasses.

Returns:

  • bool: True if service is available

Helper Methods

prepare_messages

def prepare_messages(self, request: Request) -> List[Dict[str, Any]]

Prepare messages in the format expected by the API. Adds system prompt if configured and not already present.

Parameters:

  • request (Request): The evaluation request

Returns:

  • List[Dict[str, Any]]: Messages with system prompt prepended if needed

Evaluation Methods

evaluate_binary

def evaluate_binary(
    self,
    question: str,
    answer: str,
    prediction: str,
    output_format: str = "0/1",
    custom_prompt: Optional[str] = None,
    **kwargs
) -> Dict[str, Any]

Evaluate binary correctness of a prediction against an answer.

Parameters:

  • question (str): The question being evaluated
  • answer (str): Ground truth answer
  • prediction (str): Model's predicted answer
  • output_format (str): Format for output, "0/1" or "yes/no" (default: "0/1")
  • custom_prompt (Optional[str]): Custom evaluation prompt
  • **kwargs: Additional prompt formatting arguments

Returns:

  • Dict[str, Any]: Contains:
    • result: Parsed binary result (1/0 or True/False)
    • raw_response: Raw judge response text
    • model: Model name used for evaluation
    • prompt: The complete evaluation prompt
    • success: Whether evaluation succeeded

Process:

  1. Builds binary prompt using JudgePromptBuilder
  2. Creates Request with prompt
  3. Calls abstract evaluate() method
  4. Parses response using ResponseParser
  5. Returns structured result

evaluate_comparative

def evaluate_comparative(
    self,
    question: str,
    response1: str,
    response2: str,
    context: Optional[str] = None,
    score_range: Tuple[int, int] = (1, 10),
    custom_prompt: Optional[str] = None,
    images: Optional[List[Union[str, bytes]]] = None,
    **kwargs
) -> Dict[str, Any]

Evaluate and compare two responses to the same question.

Parameters:

  • question (str): The question both responses address
  • response1 (str): First response to evaluate
  • response2 (str): Second response to evaluate
  • context (Optional[str]): Additional context for evaluation
  • score_range (Tuple[int, int]): Min and max scores (default: (1, 10))
  • custom_prompt (Optional[str]): Custom evaluation prompt
  • images (Optional[List[Union[str, bytes]]]): Images for context
  • **kwargs: Additional prompt formatting arguments

Returns:

  • Dict[str, Any]: Contains:
    • scores: Tuple of (score1, score2)
    • raw_response: Raw judge response
    • model: Model used
    • prompt: Evaluation prompt
    • success: Success status

evaluate_with_rubric

def evaluate_with_rubric(
    self,
    question: str,
    prediction: str,
    rubric: Dict[str, Any],
    **kwargs
) -> Dict[str, Any]

Evaluate a response using a custom rubric with multiple criteria.

Parameters:

  • question (str): The question being evaluated
  • prediction (str): Response to evaluate
  • rubric (Dict[str, Any]): Dictionary of criterion_name: description
  • **kwargs: Additional arguments

Returns:

  • Dict[str, Any]: Contains:
    • scores: Parsed JSON with scores for each rubric item
    • raw_response: Raw response
    • model: Model used
    • prompt: Evaluation prompt
    • success: Success status

Process:

  1. Formats rubric as bullet list
  2. Constructs evaluation prompt with question, response, and rubric
  3. Requests JSON-formatted response
  4. Parses JSON scores using ResponseParser

AsyncServerInterface

Extends ServerInterface to provide asynchronous evaluation capabilities for high-throughput batch processing.

Constructor

def __init__(self, config: Optional[ServerConfig] = None)

Parameters:

  • config (Optional[ServerConfig]): Configuration including max_concurrent limit

Attributes:

  • self.semaphore: asyncio.Semaphore controlling concurrent evaluations (limit from config.max_concurrent)

Abstract Async Methods

evaluate_async

@abc.abstractmethod
async def evaluate_async(self, request: Request) -> Response

Asynchronously evaluate a request. Must be implemented by subclasses.

Parameters:

  • request (Request): Evaluation request

Returns:

  • Response: Evaluation response

Sync Wrapper

evaluate

def evaluate(self, request: Request) -> Response

Synchronous wrapper for async evaluation using event loop.

Parameters:

  • request (Request): Evaluation request

Returns:

  • Response: Evaluation response

Implementation:

  • Gets current event loop
  • Runs evaluate_async() to completion
  • Returns result

Batch Processing

evaluate_batch

async def evaluate_batch(self, requests: List[Request]) -> List[Response]

Evaluate multiple requests concurrently with semaphore-based throttling.

Parameters:

  • requests (List[Request]): List of evaluation requests

Returns:

  • List[Response]: Responses in same order as requests

Implementation:

  • Creates async tasks for all requests
  • Uses asyncio.gather() for concurrent execution
  • Semaphore (from constructor) limits concurrency to config.max_concurrent

Async Evaluation Methods

evaluate_binary_async

async def evaluate_binary_async(
    self,
    question: str,
    answer: str,
    prediction: str,
    output_format: str = "0/1",
    custom_prompt: Optional[str] = None,
    **kwargs
) -> Dict[str, Any]

Asynchronously evaluate binary correctness. Same parameters and return format as synchronous evaluate_binary().

evaluate_binary_batch_async

async def evaluate_binary_batch_async(
    self,
    questions: List[str],
    answers: List[str],
    predictions: List[str],
    output_format: str = "0/1",
    custom_prompt: Optional[str] = None,
    **kwargs
) -> List[Dict[str, Any]]

Asynchronously evaluate multiple binary correctness tasks in batch.

Parameters:

  • questions (List[str]): List of questions
  • answers (List[str]): List of ground truth answers
  • predictions (List[str]): List of predictions to evaluate
  • output_format (str): Output format (default: "0/1")
  • custom_prompt (Optional[str]): Custom evaluation prompt
  • **kwargs: Additional arguments

Returns:

  • List[Dict[str, Any]]: List of evaluation results

Validation:

  • Raises ValueError if input lists have different lengths

evaluate_comparative_async

async def evaluate_comparative_async(
    self,
    question: str,
    response1: str,
    response2: str,
    context: Optional[str] = None,
    score_range: Tuple[int, int] = (1, 10),
    custom_prompt: Optional[str] = None,
    images: Optional[List[Union[str, bytes]]] = None,
    **kwargs
) -> Dict[str, Any]

Asynchronously evaluate comparative responses. Same parameters and return format as synchronous evaluate_comparative().

evaluate_comparative_batch_async

async def evaluate_comparative_batch_async(
    self,
    questions: List[str],
    responses1: List[str],
    responses2: List[str],
    contexts: Optional[List[Optional[str]]] = None,
    score_range: Tuple[int, int] = (1, 10),
    custom_prompt: Optional[str] = None,
    images_list: Optional[List[Optional[List[Union[str, bytes]]]]] = None,
    **kwargs
) -> List[Dict[str, Any]]

Asynchronously evaluate multiple comparative response tasks in batch.

Parameters:

  • questions (List[str]): List of questions
  • responses1 (List[str]): List of first responses
  • responses2 (List[str]): List of second responses
  • contexts (Optional[List[Optional[str]]]): List of contexts (defaults to [None] * len(questions))
  • score_range (Tuple[int, int]): Score range (default: (1, 10))
  • custom_prompt (Optional[str]): Custom prompt
  • images_list (Optional[List[Optional[List[Union[str, bytes]]]]]): List of image lists (defaults to [None] * len(questions))
  • **kwargs: Additional arguments

Returns:

  • List[Dict[str, Any]]: List of evaluation results

Validation:

  • Raises ValueError if questions and responses lists have different lengths
  • Fills in None defaults for contexts and images_list if not provided

evaluate_with_rubric_async

async def evaluate_with_rubric_async(
    self,
    question: str,
    prediction: str,
    rubric: Dict[str, Any],
    **kwargs
) -> Dict[str, Any]

Asynchronously evaluate with custom rubric. Same parameters and return format as synchronous evaluate_with_rubric().

Design Patterns

Template Method Pattern

Base classes implement high-level evaluation logic (prompt building, request creation, response parsing) while leaving low-level API calls (evaluate, evaluate_async) to subclasses.

Adapter Pattern

Synchronous evaluate() method adapts async evaluate_async() for sync contexts using event loop wrapper.

Semaphore Pattern

AsyncServerInterface uses semaphore to limit concurrent API requests, preventing rate limit issues and resource exhaustion.

Builder Pattern

Delegates prompt construction to JudgePromptBuilder for maintainability and reusability.

Usage Example

Implementing a Custom Judge

from lmms_eval.llm_judge.base import ServerInterface
from lmms_eval.llm_judge.protocol import Request, Response, ServerConfig

class CustomJudge(ServerInterface):
    def __init__(self, config: Optional[ServerConfig] = None):
        super().__init__(config)
        # Initialize API client, etc.

    def evaluate(self, request: Request) -> Response:
        # Call custom API
        messages = self.prepare_messages(request)
        api_response = self.api_client.call(messages)

        return Response(
            content=api_response.text,
            model_used=self.config.model_name,
            success=True
        )

    def is_available(self) -> bool:
        # Check service health
        return self.api_client.ping()

# Use the judge
judge = CustomJudge(config=ServerConfig(model_name="my-model"))
result = judge.evaluate_binary(
    question="What is the capital of France?",
    answer="Paris",
    prediction="The capital is Paris"
)
print(result["result"])  # 1 (correct)

Async Batch Evaluation

from lmms_eval.llm_judge.base import AsyncServerInterface

class AsyncCustomJudge(AsyncServerInterface):
    async def evaluate_async(self, request: Request) -> Response:
        async with self.semaphore:  # Respects max_concurrent
            messages = self.prepare_messages(request)
            response = await self.api_client.call_async(messages)
            return Response(
                content=response.text,
                model_used=self.config.model_name,
                success=True
            )

    def is_available(self) -> bool:
        return True

# Batch evaluation
config = ServerConfig(model_name="my-model", max_concurrent=5)
judge = AsyncCustomJudge(config)

questions = ["Q1", "Q2", "Q3", ...]
answers = ["A1", "A2", "A3", ...]
predictions = ["P1", "P2", "P3", ...]

results = await judge.evaluate_binary_batch_async(
    questions, answers, predictions
)

Extension Points

Subclasses must implement:

  1. evaluate() or evaluate_async(): Core evaluation logic
  2. is_available(): Service health check

Subclasses may override:

  1. prepare_messages(): Custom message formatting
  2. Configuration handling in constructor
  3. Error handling and retry logic

Related Implementations

Best Practices

  1. Always use semaphore in async implementations to respect rate limits
  2. Validate input list lengths in batch methods to fail fast
  3. Use prepare_messages() to ensure consistent message formatting
  4. Return structured results with success flags for robust error handling
  5. Leverage built-in prompt builders for consistency across implementations

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment