Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:EvolvingLMMs Lab Lmms eval Alpaca Audio Utils

From Leeroopedia

File: lmms_eval/tasks/alpaca_audio/utils.py (131 lines)

Principle: Task Utility Functions

Overview

Utility functions for the Alpaca Audio evaluation task. Uses GPT-based evaluation with a 0-5 rating scale to assess audio understanding and instruction following. Supports both OpenAI and Azure API endpoints.

Configuration

Loads configuration from alpaca_audio.yaml and environment variables:

  • MODEL_VERSION - GPT model (default: "gpt-4o-2024-11-20")
  • API_TYPE - "openai" or "azure" (default: "azure")
  • OPENAI_API_URL / AZURE_ENDPOINT - API endpoints
  • OPENAI_API_KEY / AZURE_API_KEY - Authentication
  • NUM_SECONDS_TO_SLEEP - 5 seconds between retries
  • retries - 3 retry attempts

Key Functions

doc_to_audio

def doc_to_audio(doc)

Extracts audio context from document.

Parameters:

  • doc - Document with "context" field

Returns: List containing audio data

doc_to_text

def doc_to_text(doc, lmms_eval_specific_kwargs)

Formats text prompt using pre and post prompts.

Parameters:

  • doc - Document instance
  • lmms_eval_specific_kwargs - Dictionary with pre_prompt and post_prompt

Returns: Concatenated prompt string

get_eval

def get_eval(max_tokens: int, content: str, retries: int = retries)

Calls GPT API for evaluation with retry logic.

Parameters:

  • max_tokens - Maximum response length (typically 1024)
  • content - Evaluation prompt with question, ground truth, and model response
  • retries - Number of retry attempts (default: 3)

Returns: Tuple of (evaluation response, model name)

API Settings:

  • Temperature: 0.7
  • top_p: 0.95
  • frequency_penalty: 0
  • presence_penalty: 0
  • Timeout: 60 seconds

Error Handling:

  • Retries with 5-second delays
  • Logs attempt failures
  • Returns empty strings after all retries fail

alpaca_audio_process_results

def alpaca_audio_process_results(doc, result)

Processes model results using GPT evaluation.

Parameters:

  • doc - Document with "answer" and "speech_instruction" fields
  • result - Model's response

Returns: Dictionary with "gpt_eval" containing eval_answer and model_name

Evaluation Process:

  1. Formats prompt with question, reference answer, and model answer
  2. Calls GPT API with max 1024 tokens
  3. Returns evaluation and model identifier

alpaca_audio_aggregate_results

def alpaca_audio_aggregate_results(results)

Aggregates evaluation scores across all results.

Parameters:

  • results - List of evaluation dictionaries with eval_answer field

Returns: Average score scaled to 0-100 (multiplied by 20)

Score Extraction:

  • Uses regex to find single digit 0-5 in eval_answer
  • Converts to float
  • Defaults to 0.0 on parsing errors
  • Logs errors for debugging

Evaluation Prompt

The eval_prompt template assesses model responses on:

  • Alignment with reference answer
  • Accuracy of content
  • Relevance to the question
  • Critical evaluation of details

Scoring Criteria:

  • Score 0: Completely misaligned, incorrect or irrelevant
  • Score 1: Minimal alignment, often misunderstanding or irrelevant
  • Score 2: Recognizes topic but diverges significantly
  • Score 3: Generally aligned but lacks detail or precision
  • Score 4: Mostly accurate and relevant, closely following reference
  • Score 5: Highly accurate, detailed, perfect match with reference

Output Format:

Explanation: (Concise comparison: "The reference answer is [XXX], while the model's answer is [YYY]. I think ...")
Rating: (int)

Usage Pattern

  1. Extract audio context from document
  2. Format prompt with pre/post text
  3. Get model's response to audio + instruction
  4. Call GPT evaluator with question, reference, and prediction
  5. Parse score (0-5) from evaluation
  6. Scale to 0-100 for final metric

Dependencies

  • os, re, time - Standard library
  • requests - HTTP API calls
  • yaml - Configuration loading
  • loguru - Logging (eval_logger)

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment