Implementation:EvolvingLMMs Lab Lmms eval whisper tt

Knowledge Sources	EvolvingLMMs_Lab_Lmms_eval
Domains	Speech Recognition, Audio Processing, HTTP API Client
Last Updated	2026-02-14 00:00 GMT

Overview

HTTP API client for evaluating Whisper audio transcription models via tt-media-server backend.

Description

This module implements a WhisperTT model wrapper that uses HTTP calls to a tt-media-server instead of direct ttnn/tt-metal execution. This allows evaluations to run outside Docker containers while leveraging the performance benefits of the TT-NN hardware acceleration. The implementation handles audio encoding to base64 WAV format, asynchronous batch transcription requests, retry logic for robustness, and integration with the lmms-eval framework's distributed evaluation system via Accelerator.

Usage

Use this model wrapper when evaluating Whisper models (e.g., whisper-large-v3) on audio transcription tasks, running evaluations without Docker/TT-NN dependencies, or leveraging TT-NN hardware acceleration through a remote API endpoint. Set the OPENAI_API_BASE environment variable to point to your tt-media-server instance.

Code Reference

Source Location

Repository: EvolvingLMMs_Lab_Lmms_eval
File: lmms_eval/models/whisper_tt.py
Lines: 1-356

Signature

@register_model("whisper_tt")
class WhisperTT(lmms):
    def __init__(
        self,
        pretrained: str = "openai/whisper-large-v3",
        device: str = "cuda",
        device_map: str = "cuda",
        batch_size: int = 1000,
        use_cache: bool = True,
        language: str = "en",
        task: str = "transcribe",
        base_url: str = None,
        timeout: int = 300,
        max_retries: int = 3,
        num_concurrent: int = 1,
        **kwargs,
    ) -> None

    def encode_audio_to_base64_wav(
        self,
        audio_array: np.ndarray,
        sampling_rate: int
    ) -> str

    def transcribe_audio(
        self,
        audio_array: np.ndarray,
        sampling_rate: int
    ) -> str

    async def _generate_audio_transcription(
        self,
        session,
        audio_array: np.ndarray,
        sampling_rate: int,
        audio_index: int = None
    ) -> str

    def generate_until(self, requests: List[Instance]) -> List[str]

Import

from lmms_eval.models.whisper_tt import WhisperTT

I/O Contract

Inputs

Name	Type	Required	Description
pretrained	str	Yes	HuggingFace model identifier (e.g., "openai/whisper-large-v3")
base_url	str	No	HTTP endpoint for tt-media-server (default: from OPENAI_API_BASE env var)
language	str	No	Target language code (default: "en")
task	str	No	Task type: "transcribe" or "translate" (default: "transcribe")
batch_size	int	No	Batch size for evaluation (default: 1000)
timeout	int	No	Request timeout in seconds (default: 300)
max_retries	int	No	Maximum retry attempts (default: 3)

Outputs

Name	Type	Description
transcriptions	List[str]	List of transcribed text strings for each audio input

Usage Examples

Basic Evaluation

# Set the API endpoint
export OPENAI_API_BASE="http://127.0.0.1:8000"
export OPENAI_API_KEY="your-secret-key"

# Run evaluation
python -m lmms_eval \
    --model whisper_tt \
    --model_args pretrained=openai/whisper-large-v3,language=en,task=transcribe,base_url=http://127.0.0.1:8000 \
    --tasks librispeech \
    --batch_size 1000 \
    --device cuda:0

Programmatic Usage

from lmms_eval.models.whisper_tt import WhisperTT
import numpy as np

# Initialize model
model = WhisperTT(
    pretrained="openai/whisper-large-v3",
    base_url="http://127.0.0.1:8000",
    language="en",
    task="transcribe",
    max_retries=5,
    timeout=600
)

# Transcribe audio
audio_array = np.random.randn(16000 * 5)  # 5 seconds at 16kHz
sampling_rate = 16000
transcription = model.transcribe_audio(audio_array, sampling_rate)
print(transcription)

Distributed Evaluation

# The model automatically uses Accelerate for multi-GPU setups
accelerate launch --num_processes=4 -m lmms_eval \
    --model whisper_tt \
    --model_args pretrained=openai/whisper-large-v3 \
    --tasks librispeech \
    --batch_size 1000

Implementation Details

Audio Encoding

Audio arrays are converted to float32 (not float64) to prevent "Unsupported bit depth: 64" errors on the server. The audio is written to an in-memory WAV buffer using scipy.io.wavfile, then base64-encoded for HTTP transmission.

Async Batch Processing

The generate_until method collects all audio samples first, then processes them in parallel using asyncio.gather() with aiohttp sessions. This achieves significantly better throughput than sequential processing.

Retry Logic

Synchronous transcribe_audio() includes retry logic with max_retries attempts. Asynchronous _generate_audio_transcription() logs errors but does not retry to avoid cascading delays.

Response Parsing

The server response is expected to be JSON with a "text", "transcription", or "result" key. If none are found, the entire response is returned as a string.

Distributed Setup

Uses Accelerator to automatically detect multi-process setups and assign appropriate device indices and rank/world_size values.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment