Implementation:EvolvingLMMs Lab Lmms eval LLM Judge Protocol
Overview
This implementation defines the protocol layer for LLM judge evaluation, providing type-safe data structures for requests, responses, and configuration. It establishes a standardized contract for communication between judge implementations and callers, using Python dataclasses with comprehensive type hints.
File Location
/tmp/kapso_repo_sslb_59s/lmms_eval/llm_judge/protocol.py (69 lines)
Related Principle
Dependencies
dataclasses: Dataclass decorator and field utilitiestyping: Type hints (Any, Dict, List, Optional, Tuple, Union)
Constants
Retry Configuration
DEFAULT_NUM_RETRIES = 5
DEFAULT_RETRY_DELAY = 10 # seconds
Default retry behavior for API calls:
- DEFAULT_NUM_RETRIES: Number of retry attempts on transient failures
- DEFAULT_RETRY_DELAY: Delay between retry attempts in seconds
These values balance reliability (sufficient retries) with responsiveness (avoiding excessive delays).
Data Structures
ServerConfig
Configuration dataclass for judge models and API behavior.
@dataclass
class ServerConfig:
"""Configuration for judge models"""
model_name: str
temperature: float = 0.0
max_tokens: int = 1024
top_p: Optional[float] = None
timeout: int = 60
num_retries: int = DEFAULT_NUM_RETRIES
retry_delay: float = DEFAULT_RETRY_DELAY
max_concurrent: int = 10
system_prompt: Optional[str] = None
response_format: Optional[str] = None
judge_type: str = "general"
output_format: Optional[str] = None
score_range: Optional[Tuple[float, float]] = None
evaluation_criteria: Optional[Dict[str, Any]] = None
Core Model Parameters
model_name (str, required)
- Model identifier for API calls
- Examples: "gpt-4", "gpt-3.5-turbo", "gpt-4-turbo"
- Required field - must be specified
temperature (float, default: 0.0)
- Sampling temperature for model responses
- Range: 0.0 (deterministic) to 2.0 (very random)
- Default 0.0 for consistent, reproducible evaluation
- Higher values increase response variability
max_tokens (int, default: 1024)
- Maximum number of tokens in model response
- Controls response length and API cost
- Should accommodate expected evaluation output
top_p (Optional[float], default: None)
- Nucleus sampling parameter
- Range: 0.0 to 1.0
- Alternative to temperature for controlling randomness
- None uses API default
API Behavior Parameters
timeout (int, default: 60)
- Request timeout in seconds
- Prevents hanging on slow API responses
- Balance between allowing complex evaluations and failing fast
num_retries (int, default: 5)
- Number of retry attempts on transient failures
- Handles rate limits, network issues, temporary outages
- Higher values increase reliability but delay final failure
retry_delay (float, default: 10.0)
- Seconds to wait between retry attempts
- Exponential backoff often applied in implementations
- Prevents overwhelming rate-limited APIs
max_concurrent (int, default: 10)
- Maximum concurrent requests for async providers
- Controls rate limiting and resource usage
- Higher values increase throughput but may hit rate limits
Response Configuration
system_prompt (Optional[str], default: None)
- System-level instruction for the judge model
- Prepended to messages if provided
- Can establish evaluation persona or constraints
response_format (Optional[str], default: None)
- Expected response format
- Values: "json", "text", or None
- Some APIs support structured output formats
Judge-Specific Parameters
judge_type (str, default: "general")
- Type of evaluation being performed
- Values: "general", "binary", "score", "comparative"
- Helps implementations optimize behavior
output_format (Optional[str], default: None)
- Format for binary evaluations
- Values: "0/1", "1/0", "yes/no"
- Affects prompt construction and parsing
score_range (Optional[Tuple[float, float]], default: None)
- Minimum and maximum scores for scoring judges
- Example: (1, 10) for 1-10 scale
- Used in comparative evaluation
evaluation_criteria (Optional[Dict[str, Any]], default: None)
- Custom evaluation criteria or rubric
- Flexible structure for domain-specific requirements
- Example: {"accuracy": "weight: 0.4", "clarity": "weight: 0.3"}
Usage Example
# Basic configuration
config = ServerConfig(model_name="gpt-4")
# Detailed configuration
config = ServerConfig(
model_name="gpt-4-turbo",
temperature=0.0,
max_tokens=512,
timeout=30,
num_retries=3,
retry_delay=5,
max_concurrent=20,
system_prompt="You are a strict evaluator.",
response_format="json",
judge_type="binary",
output_format="0/1"
)
Request
Standard request format for judge evaluation.
@dataclass
class Request:
"""Standard request format for judge evaluation"""
messages: List[Dict[str, Any]]
images: Optional[List[Union[str, bytes]]] = None
config: Optional[ServerConfig] = None
question: Optional[str] = None
answer: Optional[str] = None
prediction: Optional[str] = None
context: Optional[str] = None
options: Optional[List[str]] = None
response1: Optional[str] = None
response2: Optional[str] = None
custom_prompt: Optional[str] = None
prompt_kwargs: Dict[str, Any] = field(default_factory=dict)
Core Request Fields
messages (List[Dict[str, Any]], required)
- List of message dictionaries in chat format
- Each message: {"role": "user"/"system"/"assistant", "content": "..."}
- Primary input for judge evaluation
- Example: [{"role": "user", "content": "Evaluate this response..."}]
images (Optional[List[Union[str, bytes]]], default: None)
- Image inputs for multimodal evaluation
- Can be file paths (str) or base64-encoded bytes
- Supports visual question answering evaluation
config (Optional[ServerConfig], default: None)
- Per-request configuration override
- Falls back to judge's default config if None
- Allows request-specific parameters
Structured Evaluation Fields
question (Optional[str])
- The question being evaluated
- Used in binary and comparative evaluation
answer (Optional[str])
- Ground truth answer
- Reference for correctness evaluation
prediction (Optional[str])
- Model's predicted answer
- Subject of evaluation
context (Optional[str])
- Additional context for evaluation
- Background information, constraints, etc.
options (Optional[List[str]])
- Answer choices for multiple-choice questions
- Example: ["A. Paris", "B. London", "C. Berlin"]
Comparative Evaluation Fields
response1 (Optional[str])
- First response for comparison
response2 (Optional[str])
- Second response for comparison
Custom Prompt Fields
custom_prompt (Optional[str])
- Override default prompt template
- Must include appropriate placeholders
prompt_kwargs (Dict[str, Any])
- Additional keyword arguments for prompt formatting
- Default: empty dict (via field(default_factory=dict))
- Allows dynamic prompt customization
Usage Example
# Binary evaluation request
request = Request(
messages=[{"role": "user", "content": "Evaluate..."}],
question="What is 2+2?",
answer="4",
prediction="Four",
config=config
)
# Comparative evaluation request
request = Request(
messages=[{"role": "user", "content": "Compare..."}],
question="Explain quantum computing",
response1="Response from Model A",
response2="Response from Model B",
context="For a general audience",
config=config
)
# Multimodal request
request = Request(
messages=[{"role": "user", "content": "Evaluate this image description"}],
images=["path/to/image.jpg"],
question="What is in the image?",
answer="A cat",
prediction="A cat sitting on a couch"
)
Response
Standard response format from judge evaluation.
@dataclass
class Response:
"""Standard response format from judge evaluation"""
content: str
model_used: str
usage: Optional[Dict[str, int]] = None
raw_response: Optional[Any] = None
parsed_result: Optional[Union[int, float, bool, Tuple[float, float], Dict[str, Any]]] = None
success: bool = True
error_message: Optional[str] = None
Core Response Fields
content (str, required)
- Raw text content from judge model
- Complete response including reasoning (if any)
- Subject to further parsing
model_used (str, required)
- Identifier of the model that generated the response
- May differ from requested model (e.g., fallback scenarios)
- Important for tracking evaluation provenance
API Metadata
usage (Optional[Dict[str, int]], default: None)
- Token usage statistics
- Typical structure: {"prompt_tokens": 100, "completion_tokens": 50, "total_tokens": 150}
- Useful for cost tracking and monitoring
raw_response (Optional[Any], default: None)
- Complete raw API response object
- Preserved for debugging and auditing
- May contain additional API-specific metadata
Parsed Results
parsed_result (Optional[Union[int, float, bool, Tuple[float, float], Dict[str, Any]]], default: None)
- Structured result extracted from content
- Type depends on evaluation type:
- Binary: int (0/1) or bool
- Score: float
- Comparative: Tuple[float, float] (two scores)
- Rubric: Dict[str, Any] (criterion -> score mapping)
Status Fields
success (bool, default: True)
- Whether evaluation completed successfully
- False indicates error or failure
error_message (Optional[str], default: None)
- Error description if success=False
- Helpful for debugging and error handling
Usage Example
# Successful binary evaluation response
response = Response(
content="1",
model_used="gpt-4",
usage={"prompt_tokens": 120, "completion_tokens": 1, "total_tokens": 121},
parsed_result=1,
success=True
)
# Comparative evaluation response
response = Response(
content="8 7\nAssistant 1 provided more detail...",
model_used="gpt-4-turbo",
parsed_result=(8.0, 7.0),
success=True
)
# Failed evaluation response
response = Response(
content="",
model_used="gpt-4",
success=False,
error_message="API rate limit exceeded"
)
Design Principles
Type Safety
All fields have explicit type hints, enabling:
- Static type checking with mypy
- IDE autocomplete and inline documentation
- Early detection of type errors
- Self-documenting interfaces
Immutability
Dataclasses are by default immutable after creation, promoting:
- Thread-safe usage
- Predictable behavior
- Easier debugging (no unexpected mutations)
Optional Fields with Sensible Defaults
Most fields are optional with reasonable defaults:
- Reduces boilerplate for simple cases
- Allows gradual complexity increase
- Backwards compatible with simple usage patterns
Separation of Concerns
Three distinct data structures:
- ServerConfig: How to call the API
- Request: What to evaluate
- Response: Evaluation results
Flexibility
Support for multiple evaluation paradigms:
- Binary (question/answer/prediction)
- Comparative (response1/response2)
- Rubric-based (evaluation_criteria)
- Custom (prompt_kwargs, custom_prompt)
Extensibility
Dict fields (prompt_kwargs, evaluation_criteria, usage) allow:
- Domain-specific extensions
- Future field additions without breaking changes
- Custom metadata
Integration Patterns
Request Construction
from lmms_eval.llm_judge.protocol import Request, ServerConfig
config = ServerConfig(model_name="gpt-4", temperature=0.0)
request = Request(
messages=[{"role": "user", "content": prompt}],
question=q,
answer=a,
prediction=p,
config=config
)
Response Processing
response = judge.evaluate(request)
if response.success:
print(f"Result: {response.parsed_result}")
print(f"Model: {response.model_used}")
print(f"Tokens: {response.usage}")
else:
print(f"Error: {response.error_message}")
Configuration Management
# Default config for all requests
default_config = ServerConfig(
model_name="gpt-4",
temperature=0.0,
max_tokens=512
)
# Override for specific request
special_config = ServerConfig(
model_name="gpt-4-turbo",
temperature=0.3, # Slightly more creative
max_tokens=2048
)
request = Request(messages=[...], config=special_config)
Validation Considerations
While the dataclasses don't include runtime validation, implementations should validate:
- Required fields are present (model_name, messages, content, model_used)
- Value ranges (temperature, top_p, timeout)
- Consistency (if response1 is set, response2 should be too)
- Format expectations (output_format matches parsing logic)
Related Implementations
- LLM Judge Base: Uses Request/Response/ServerConfig for evaluation
- LLM Judge Factory: Accepts ServerConfig for provider creation
- LLM Judge Utils: Operates on these protocol types
- Provider implementations: Implement evaluation using these types
Best Practices
Configuration
- Set temperature=0.0 for deterministic evaluation
- Adjust max_concurrent based on rate limits
- Use appropriate num_retries for reliability/latency tradeoff
Request Construction
- Always provide config for reproducibility
- Use structured fields (question, answer, prediction) over raw messages when possible
- Include usage context in prompt_kwargs for debugging
Response Handling
- Always check success field before using parsed_result
- Log error_message for failed evaluations
- Preserve raw_response for auditing
Type Safety
- Use type hints in calling code
- Run mypy for static type checking
- Document expected types for custom fields (prompt_kwargs, evaluation_criteria)