Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:EvolvingLMMs Lab Lmms eval LLM Judge Protocol

From Leeroopedia
Revision as of 12:31, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/EvolvingLMMs_Lab_Lmms_eval_LLM_Judge_Protocol.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Overview

This implementation defines the protocol layer for LLM judge evaluation, providing type-safe data structures for requests, responses, and configuration. It establishes a standardized contract for communication between judge implementations and callers, using Python dataclasses with comprehensive type hints.

File Location

/tmp/kapso_repo_sslb_59s/lmms_eval/llm_judge/protocol.py (69 lines)

Related Principle

LLM as Judge

Dependencies

  • dataclasses: Dataclass decorator and field utilities
  • typing: Type hints (Any, Dict, List, Optional, Tuple, Union)

Constants

Retry Configuration

DEFAULT_NUM_RETRIES = 5
DEFAULT_RETRY_DELAY = 10  # seconds

Default retry behavior for API calls:

  • DEFAULT_NUM_RETRIES: Number of retry attempts on transient failures
  • DEFAULT_RETRY_DELAY: Delay between retry attempts in seconds

These values balance reliability (sufficient retries) with responsiveness (avoiding excessive delays).

Data Structures

ServerConfig

Configuration dataclass for judge models and API behavior.

@dataclass
class ServerConfig:
    """Configuration for judge models"""

    model_name: str
    temperature: float = 0.0
    max_tokens: int = 1024
    top_p: Optional[float] = None
    timeout: int = 60
    num_retries: int = DEFAULT_NUM_RETRIES
    retry_delay: float = DEFAULT_RETRY_DELAY
    max_concurrent: int = 10

    system_prompt: Optional[str] = None
    response_format: Optional[str] = None

    judge_type: str = "general"
    output_format: Optional[str] = None
    score_range: Optional[Tuple[float, float]] = None
    evaluation_criteria: Optional[Dict[str, Any]] = None

Core Model Parameters

model_name (str, required)

  • Model identifier for API calls
  • Examples: "gpt-4", "gpt-3.5-turbo", "gpt-4-turbo"
  • Required field - must be specified

temperature (float, default: 0.0)

  • Sampling temperature for model responses
  • Range: 0.0 (deterministic) to 2.0 (very random)
  • Default 0.0 for consistent, reproducible evaluation
  • Higher values increase response variability

max_tokens (int, default: 1024)

  • Maximum number of tokens in model response
  • Controls response length and API cost
  • Should accommodate expected evaluation output

top_p (Optional[float], default: None)

  • Nucleus sampling parameter
  • Range: 0.0 to 1.0
  • Alternative to temperature for controlling randomness
  • None uses API default

API Behavior Parameters

timeout (int, default: 60)

  • Request timeout in seconds
  • Prevents hanging on slow API responses
  • Balance between allowing complex evaluations and failing fast

num_retries (int, default: 5)

  • Number of retry attempts on transient failures
  • Handles rate limits, network issues, temporary outages
  • Higher values increase reliability but delay final failure

retry_delay (float, default: 10.0)

  • Seconds to wait between retry attempts
  • Exponential backoff often applied in implementations
  • Prevents overwhelming rate-limited APIs

max_concurrent (int, default: 10)

  • Maximum concurrent requests for async providers
  • Controls rate limiting and resource usage
  • Higher values increase throughput but may hit rate limits

Response Configuration

system_prompt (Optional[str], default: None)

  • System-level instruction for the judge model
  • Prepended to messages if provided
  • Can establish evaluation persona or constraints

response_format (Optional[str], default: None)

  • Expected response format
  • Values: "json", "text", or None
  • Some APIs support structured output formats

Judge-Specific Parameters

judge_type (str, default: "general")

  • Type of evaluation being performed
  • Values: "general", "binary", "score", "comparative"
  • Helps implementations optimize behavior

output_format (Optional[str], default: None)

  • Format for binary evaluations
  • Values: "0/1", "1/0", "yes/no"
  • Affects prompt construction and parsing

score_range (Optional[Tuple[float, float]], default: None)

  • Minimum and maximum scores for scoring judges
  • Example: (1, 10) for 1-10 scale
  • Used in comparative evaluation

evaluation_criteria (Optional[Dict[str, Any]], default: None)

  • Custom evaluation criteria or rubric
  • Flexible structure for domain-specific requirements
  • Example: {"accuracy": "weight: 0.4", "clarity": "weight: 0.3"}

Usage Example

# Basic configuration
config = ServerConfig(model_name="gpt-4")

# Detailed configuration
config = ServerConfig(
    model_name="gpt-4-turbo",
    temperature=0.0,
    max_tokens=512,
    timeout=30,
    num_retries=3,
    retry_delay=5,
    max_concurrent=20,
    system_prompt="You are a strict evaluator.",
    response_format="json",
    judge_type="binary",
    output_format="0/1"
)

Request

Standard request format for judge evaluation.

@dataclass
class Request:
    """Standard request format for judge evaluation"""

    messages: List[Dict[str, Any]]
    images: Optional[List[Union[str, bytes]]] = None
    config: Optional[ServerConfig] = None

    question: Optional[str] = None
    answer: Optional[str] = None
    prediction: Optional[str] = None
    context: Optional[str] = None
    options: Optional[List[str]] = None

    response1: Optional[str] = None
    response2: Optional[str] = None

    custom_prompt: Optional[str] = None
    prompt_kwargs: Dict[str, Any] = field(default_factory=dict)

Core Request Fields

messages (List[Dict[str, Any]], required)

  • List of message dictionaries in chat format
  • Each message: {"role": "user"/"system"/"assistant", "content": "..."}
  • Primary input for judge evaluation
  • Example: [{"role": "user", "content": "Evaluate this response..."}]

images (Optional[List[Union[str, bytes]]], default: None)

  • Image inputs for multimodal evaluation
  • Can be file paths (str) or base64-encoded bytes
  • Supports visual question answering evaluation

config (Optional[ServerConfig], default: None)

  • Per-request configuration override
  • Falls back to judge's default config if None
  • Allows request-specific parameters

Structured Evaluation Fields

question (Optional[str])

  • The question being evaluated
  • Used in binary and comparative evaluation

answer (Optional[str])

  • Ground truth answer
  • Reference for correctness evaluation

prediction (Optional[str])

  • Model's predicted answer
  • Subject of evaluation

context (Optional[str])

  • Additional context for evaluation
  • Background information, constraints, etc.

options (Optional[List[str]])

  • Answer choices for multiple-choice questions
  • Example: ["A. Paris", "B. London", "C. Berlin"]

Comparative Evaluation Fields

response1 (Optional[str])

  • First response for comparison

response2 (Optional[str])

  • Second response for comparison

Custom Prompt Fields

custom_prompt (Optional[str])

  • Override default prompt template
  • Must include appropriate placeholders

prompt_kwargs (Dict[str, Any])

  • Additional keyword arguments for prompt formatting
  • Default: empty dict (via field(default_factory=dict))
  • Allows dynamic prompt customization

Usage Example

# Binary evaluation request
request = Request(
    messages=[{"role": "user", "content": "Evaluate..."}],
    question="What is 2+2?",
    answer="4",
    prediction="Four",
    config=config
)

# Comparative evaluation request
request = Request(
    messages=[{"role": "user", "content": "Compare..."}],
    question="Explain quantum computing",
    response1="Response from Model A",
    response2="Response from Model B",
    context="For a general audience",
    config=config
)

# Multimodal request
request = Request(
    messages=[{"role": "user", "content": "Evaluate this image description"}],
    images=["path/to/image.jpg"],
    question="What is in the image?",
    answer="A cat",
    prediction="A cat sitting on a couch"
)

Response

Standard response format from judge evaluation.

@dataclass
class Response:
    """Standard response format from judge evaluation"""

    content: str
    model_used: str
    usage: Optional[Dict[str, int]] = None
    raw_response: Optional[Any] = None

    parsed_result: Optional[Union[int, float, bool, Tuple[float, float], Dict[str, Any]]] = None
    success: bool = True
    error_message: Optional[str] = None

Core Response Fields

content (str, required)

  • Raw text content from judge model
  • Complete response including reasoning (if any)
  • Subject to further parsing

model_used (str, required)

  • Identifier of the model that generated the response
  • May differ from requested model (e.g., fallback scenarios)
  • Important for tracking evaluation provenance

API Metadata

usage (Optional[Dict[str, int]], default: None)

  • Token usage statistics
  • Typical structure: {"prompt_tokens": 100, "completion_tokens": 50, "total_tokens": 150}
  • Useful for cost tracking and monitoring

raw_response (Optional[Any], default: None)

  • Complete raw API response object
  • Preserved for debugging and auditing
  • May contain additional API-specific metadata

Parsed Results

parsed_result (Optional[Union[int, float, bool, Tuple[float, float], Dict[str, Any]]], default: None)

  • Structured result extracted from content
  • Type depends on evaluation type:
    • Binary: int (0/1) or bool
    • Score: float
    • Comparative: Tuple[float, float] (two scores)
    • Rubric: Dict[str, Any] (criterion -> score mapping)

Status Fields

success (bool, default: True)

  • Whether evaluation completed successfully
  • False indicates error or failure

error_message (Optional[str], default: None)

  • Error description if success=False
  • Helpful for debugging and error handling

Usage Example

# Successful binary evaluation response
response = Response(
    content="1",
    model_used="gpt-4",
    usage={"prompt_tokens": 120, "completion_tokens": 1, "total_tokens": 121},
    parsed_result=1,
    success=True
)

# Comparative evaluation response
response = Response(
    content="8 7\nAssistant 1 provided more detail...",
    model_used="gpt-4-turbo",
    parsed_result=(8.0, 7.0),
    success=True
)

# Failed evaluation response
response = Response(
    content="",
    model_used="gpt-4",
    success=False,
    error_message="API rate limit exceeded"
)

Design Principles

Type Safety

All fields have explicit type hints, enabling:

  • Static type checking with mypy
  • IDE autocomplete and inline documentation
  • Early detection of type errors
  • Self-documenting interfaces

Immutability

Dataclasses are by default immutable after creation, promoting:

  • Thread-safe usage
  • Predictable behavior
  • Easier debugging (no unexpected mutations)

Optional Fields with Sensible Defaults

Most fields are optional with reasonable defaults:

  • Reduces boilerplate for simple cases
  • Allows gradual complexity increase
  • Backwards compatible with simple usage patterns

Separation of Concerns

Three distinct data structures:

  • ServerConfig: How to call the API
  • Request: What to evaluate
  • Response: Evaluation results

Flexibility

Support for multiple evaluation paradigms:

  • Binary (question/answer/prediction)
  • Comparative (response1/response2)
  • Rubric-based (evaluation_criteria)
  • Custom (prompt_kwargs, custom_prompt)

Extensibility

Dict fields (prompt_kwargs, evaluation_criteria, usage) allow:

  • Domain-specific extensions
  • Future field additions without breaking changes
  • Custom metadata

Integration Patterns

Request Construction

from lmms_eval.llm_judge.protocol import Request, ServerConfig

config = ServerConfig(model_name="gpt-4", temperature=0.0)

request = Request(
    messages=[{"role": "user", "content": prompt}],
    question=q,
    answer=a,
    prediction=p,
    config=config
)

Response Processing

response = judge.evaluate(request)

if response.success:
    print(f"Result: {response.parsed_result}")
    print(f"Model: {response.model_used}")
    print(f"Tokens: {response.usage}")
else:
    print(f"Error: {response.error_message}")

Configuration Management

# Default config for all requests
default_config = ServerConfig(
    model_name="gpt-4",
    temperature=0.0,
    max_tokens=512
)

# Override for specific request
special_config = ServerConfig(
    model_name="gpt-4-turbo",
    temperature=0.3,  # Slightly more creative
    max_tokens=2048
)

request = Request(messages=[...], config=special_config)

Validation Considerations

While the dataclasses don't include runtime validation, implementations should validate:

  • Required fields are present (model_name, messages, content, model_used)
  • Value ranges (temperature, top_p, timeout)
  • Consistency (if response1 is set, response2 should be too)
  • Format expectations (output_format matches parsing logic)

Related Implementations

  • LLM Judge Base: Uses Request/Response/ServerConfig for evaluation
  • LLM Judge Factory: Accepts ServerConfig for provider creation
  • LLM Judge Utils: Operates on these protocol types
  • Provider implementations: Implement evaluation using these types

Best Practices

Configuration

  • Set temperature=0.0 for deterministic evaluation
  • Adjust max_concurrent based on rate limits
  • Use appropriate num_retries for reliability/latency tradeoff

Request Construction

  • Always provide config for reproducibility
  • Use structured fields (question, answer, prediction) over raw messages when possible
  • Include usage context in prompt_kwargs for debugging

Response Handling

  • Always check success field before using parsed_result
  • Log error_message for failed evaluations
  • Preserve raw_response for auditing

Type Safety

  • Use type hints in calling code
  • Run mypy for static type checking
  • Document expected types for custom fields (prompt_kwargs, evaluation_criteria)

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment