Implementation:EvolvingLMMs Lab Lmms eval Alpaca Audio Utils
File: lmms_eval/tasks/alpaca_audio/utils.py (131 lines)
Principle: Task Utility Functions
Overview
Utility functions for the Alpaca Audio evaluation task. Uses GPT-based evaluation with a 0-5 rating scale to assess audio understanding and instruction following. Supports both OpenAI and Azure API endpoints.
Configuration
Loads configuration from alpaca_audio.yaml and environment variables:
MODEL_VERSION- GPT model (default: "gpt-4o-2024-11-20")API_TYPE- "openai" or "azure" (default: "azure")OPENAI_API_URL/AZURE_ENDPOINT- API endpointsOPENAI_API_KEY/AZURE_API_KEY- AuthenticationNUM_SECONDS_TO_SLEEP- 5 seconds between retriesretries- 3 retry attempts
Key Functions
doc_to_audio
def doc_to_audio(doc)
Extracts audio context from document.
Parameters:
doc- Document with "context" field
Returns: List containing audio data
doc_to_text
def doc_to_text(doc, lmms_eval_specific_kwargs)
Formats text prompt using pre and post prompts.
Parameters:
doc- Document instancelmms_eval_specific_kwargs- Dictionary with pre_prompt and post_prompt
Returns: Concatenated prompt string
get_eval
def get_eval(max_tokens: int, content: str, retries: int = retries)
Calls GPT API for evaluation with retry logic.
Parameters:
max_tokens- Maximum response length (typically 1024)content- Evaluation prompt with question, ground truth, and model responseretries- Number of retry attempts (default: 3)
Returns: Tuple of (evaluation response, model name)
API Settings:
- Temperature: 0.7
- top_p: 0.95
- frequency_penalty: 0
- presence_penalty: 0
- Timeout: 60 seconds
Error Handling:
- Retries with 5-second delays
- Logs attempt failures
- Returns empty strings after all retries fail
alpaca_audio_process_results
def alpaca_audio_process_results(doc, result)
Processes model results using GPT evaluation.
Parameters:
doc- Document with "answer" and "speech_instruction" fieldsresult- Model's response
Returns: Dictionary with "gpt_eval" containing eval_answer and model_name
Evaluation Process:
- Formats prompt with question, reference answer, and model answer
- Calls GPT API with max 1024 tokens
- Returns evaluation and model identifier
alpaca_audio_aggregate_results
def alpaca_audio_aggregate_results(results)
Aggregates evaluation scores across all results.
Parameters:
results- List of evaluation dictionaries with eval_answer field
Returns: Average score scaled to 0-100 (multiplied by 20)
Score Extraction:
- Uses regex to find single digit 0-5 in eval_answer
- Converts to float
- Defaults to 0.0 on parsing errors
- Logs errors for debugging
Evaluation Prompt
The eval_prompt template assesses model responses on:
- Alignment with reference answer
- Accuracy of content
- Relevance to the question
- Critical evaluation of details
Scoring Criteria:
- Score 0: Completely misaligned, incorrect or irrelevant
- Score 1: Minimal alignment, often misunderstanding or irrelevant
- Score 2: Recognizes topic but diverges significantly
- Score 3: Generally aligned but lacks detail or precision
- Score 4: Mostly accurate and relevant, closely following reference
- Score 5: Highly accurate, detailed, perfect match with reference
Output Format:
Explanation: (Concise comparison: "The reference answer is [XXX], while the model's answer is [YYY]. I think ...") Rating: (int)
Usage Pattern
- Extract audio context from document
- Format prompt with pre/post text
- Get model's response to audio + instruction
- Call GPT evaluator with question, reference, and prediction
- Parse score (0-5) from evaluation
- Scale to 0-100 for final metric
Dependencies
os, re, time- Standard libraryrequests- HTTP API callsyaml- Configuration loadingloguru- Logging (eval_logger)