Implementation:EvolvingLMMs Lab Lmms eval Alpaca Audio Utils

File: lmms_eval/tasks/alpaca_audio/utils.py (131 lines)

Overview

Utility functions for the Alpaca Audio evaluation task. Uses GPT-based evaluation with a 0-5 rating scale to assess audio understanding and instruction following. Supports both OpenAI and Azure API endpoints.

Configuration

Loads configuration from alpaca_audio.yaml and environment variables:

MODEL_VERSION - GPT model (default: "gpt-4o-2024-11-20")
API_TYPE - "openai" or "azure" (default: "azure")
OPENAI_API_URL / AZURE_ENDPOINT - API endpoints
OPENAI_API_KEY / AZURE_API_KEY - Authentication
NUM_SECONDS_TO_SLEEP - 5 seconds between retries
retries - 3 retry attempts

Key Functions

doc_to_audio

def doc_to_audio(doc)

Extracts audio context from document.

Parameters:

doc - Document with "context" field

Returns: List containing audio data

doc_to_text

def doc_to_text(doc, lmms_eval_specific_kwargs)

Formats text prompt using pre and post prompts.

Parameters:

doc - Document instance
lmms_eval_specific_kwargs - Dictionary with pre_prompt and post_prompt

Returns: Concatenated prompt string

get_eval

def get_eval(max_tokens: int, content: str, retries: int = retries)

Calls GPT API for evaluation with retry logic.

Parameters:

max_tokens - Maximum response length (typically 1024)
content - Evaluation prompt with question, ground truth, and model response
retries - Number of retry attempts (default: 3)

Returns: Tuple of (evaluation response, model name)

API Settings:

Temperature: 0.7
top_p: 0.95
frequency_penalty: 0
presence_penalty: 0
Timeout: 60 seconds

Error Handling:

Retries with 5-second delays
Logs attempt failures
Returns empty strings after all retries fail

alpaca_audio_process_results

def alpaca_audio_process_results(doc, result)

Processes model results using GPT evaluation.

Parameters:

doc - Document with "answer" and "speech_instruction" fields
result - Model's response

Returns: Dictionary with "gpt_eval" containing eval_answer and model_name

Evaluation Process:

Formats prompt with question, reference answer, and model answer
Calls GPT API with max 1024 tokens
Returns evaluation and model identifier

alpaca_audio_aggregate_results

def alpaca_audio_aggregate_results(results)

Aggregates evaluation scores across all results.

Parameters:

results - List of evaluation dictionaries with eval_answer field

Returns: Average score scaled to 0-100 (multiplied by 20)

Score Extraction:

Uses regex to find single digit 0-5 in eval_answer
Converts to float
Defaults to 0.0 on parsing errors
Logs errors for debugging

Evaluation Prompt

The eval_prompt template assesses model responses on:

Alignment with reference answer
Accuracy of content
Relevance to the question
Critical evaluation of details

Scoring Criteria:

Score 0: Completely misaligned, incorrect or irrelevant
Score 1: Minimal alignment, often misunderstanding or irrelevant
Score 2: Recognizes topic but diverges significantly
Score 3: Generally aligned but lacks detail or precision
Score 4: Mostly accurate and relevant, closely following reference
Score 5: Highly accurate, detailed, perfect match with reference

Output Format:

Explanation: (Concise comparison: "The reference answer is [XXX], while the model's answer is [YYY]. I think ...")
Rating: (int)

Usage Pattern

Extract audio context from document
Format prompt with pre/post text
Get model's response to audio + instruction
Call GPT evaluator with question, reference, and prediction
Parse score (0-5) from evaluation
Scale to 0-100 for final metric

Dependencies

os, re, time - Standard library
requests - HTTP API calls
yaml - Configuration loading
loguru - Logging (eval_logger)

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment