Implementation:EvolvingLMMs Lab Lmms eval WavCaps Utils

File: `/tmp/kapso_repo_sslb_59s/lmms_eval/tasks/wavcaps/utils.py`

Overview

The WavCaps utility module provides task-specific functions for the WavCaps audio captioning benchmark. It includes document transformation functions, GPT-based evaluation using external APIs (OpenAI or Azure), and result processing and aggregation functions for audio-to-text generation tasks.

Key Components

Document Transformation

def wavcaps_doc_to_audio(doc):
    return [doc["context"]]

def wavcaps_doc_to_text(doc, lmms_eval_specific_kwargs):
    question = doc["instruction"]
    pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
    post_prompt = lmms_eval_specific_kwargs["post_prompt"]
    return f"{pre_prompt}{question}{post_prompt}"

These functions extract audio content and construct prompts from document dictionaries, following the standard task utility interface.

GPT-based Evaluation

def get_eval(max_tokens: int, content: str):
    messages = [
        {"role": "user", "content": content},
    ]
    payload = {
        "model": GPT_EVAL_MODEL_NAME,
        "messages": messages,
        "temperature": 0,
        "max_tokens": max_tokens,
        "n": 1
    }

    for attempt in range(5):
        try:
            response = requests.post(API_URL, headers=headers, json=payload, timeout=60)
            response.raise_for_status()
            response_data = response.json()
            content = response_data["choices"][0]["message"]["content"].strip()
            if content != "":
                return content, response_data["model"]
            break
        except Exception as e:
            eval_logger.info(f"Attempt {attempt + 1} failed with error: {e}")
            if attempt < 5:
                time.sleep(NUM_SECONDS_TO_SLEEP)
            else:
                eval_logger.error(f"All 5 attempts failed. Last error message: {e}")
                return "", ""
    return "", ""

The evaluation function makes API calls to GPT models with retry logic, supporting both OpenAI and Azure endpoints.

Configuration

NUM_SECONDS_TO_SLEEP = 5
GPT_EVAL_MODEL_NAME = os.getenv("MODEL_VERSION", "gpt-4o-2024-11-20")
API_TYPE = os.getenv("API_TYPE", "azure")

if API_TYPE == "openai":
    API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions")
    API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY")
    headers = {
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json",
    }
elif API_TYPE == "azure":
    API_URL = os.getenv("AZURE_ENDPOINT", "https://api.cognitive.microsoft.com/sts/v1.0/issueToken")
    API_KEY = os.getenv("AZURE_API_KEY", "YOUR_API_KEY")
    headers = {
        "api-key": API_KEY,
        "Content-Type": "application/json",
    }

Environment-based configuration supports both OpenAI and Azure API backends.

Evaluation Prompt

The module uses a structured evaluation prompt that instructs the GPT model to rate responses on a 0-5 scale:

eval_prompt = """
            [Question]
            {question}

            [Reference Answer]
            {ground_truth}

            [Model Answer]
            {model_response}

            [Task]
            Rate the model's answer based on its alignment with the reference answer, focusing on accuracy and relevance to the reference provided. Please be critical on the details.
            Criteria: Assess if the model's response mirrors the reference in terms of content, accuracy, and relevance.
            Score0: The answer is completely misaligned, providing incorrect or irrelevant information compared to the reference.
            Score1: The answer shows minimal alignment, often misunderstanding or providing irrelevant details unrelated to the reference.
            Score2: The answer recognizes the topic but diverges significantly from the reference in accuracy or relevance.
            Score3: The answer aligns with the reference generally but lacks detail or precise accuracy in some aspects.
            Score4: The answer is mostly accurate and relevant, closely following the reference but could be clearer or more detailed.
            Score5: The answer is highly accurate, detailed, and matches the reference answer perfectly, capturing its essence and detail.

            Your response should be formatted as follows:
            Explanation: (Provide a concise explanation of your rating, comparing the reference answer with the model's response. "The reference answer is [XXX], while the model's answer is [YYY]. I think ...")
            Rating: (int)"""

Result Processing

def wavcaps_process_results(doc, results):
    pred = results[0]
    ground_truth_str = doc["answer"]
    content = eval_prompt.format(
        model_response=pred,
        ground_truth=ground_truth_str,
        question=doc["instruction"]
    )
    eval_answer, model_name = get_eval(max_tokens=1024, content=content)
    return {
        "gpt_eval": {"eval_answer": eval_answer, "model_name": model_name},
    }

def wavcaps_aggregate_results(results):
    score = 0
    for result in results:
        eval_answer = result["eval_answer"]
        try:
            match = re.search(r"Rating:\s*([0-5])\s*$", eval_answer)
            eval_score = match.group(1) if match else 0
            eval_score = float(eval_score)
        except Exception as e:
            eval_logger.error(f"Error parsing eval_score: {e}")
            eval_score = 0.0
        score += eval_score
    return score / len(results)

The processing function sends each prediction to GPT for evaluation, while the aggregation function extracts ratings from GPT responses using regex and computes the mean score.

YAML Config Loading

with open(Path(__file__).parent / "wavcaps.yaml", "r") as f:
    raw_data = f.readlines()
    safe_data = []
    for i, line in enumerate(raw_data):
        # remove function definition since yaml load cannot handle it
        if "!function" not in line:
            safe_data.append(line)
    config = yaml.safe_load("".join(safe_data))

The module loads the YAML configuration file while filtering out function definitions that cannot be parsed by standard YAML loaders.

Design Patterns

External API Evaluation

The module demonstrates an external LLM-as-judge evaluation pattern where GPT models rate the quality of generated captions against reference answers, rather than using traditional metrics.

Retry Mechanism

API calls include robust retry logic with exponential backoff to handle transient network failures.

Multi-Provider Support

The configuration system supports both OpenAI and Azure API endpoints through environment variables, making it flexible for different deployment scenarios.

Dependencies

requests: HTTP library for API calls
yaml: YAML parsing for configuration
loguru: Logging via eval_logger
re: Regular expression matching for score extraction
pathlib: File path handling

Related Components

Task configuration: `lmms_eval/tasks/wavcaps/wavcaps.yaml`
Principle: Task_Utility_Functions

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment