Implementation:EvolvingLMMs Lab Lmms eval whisper tt
| Knowledge Sources | |
|---|---|
| Domains | Speech Recognition, Audio Processing, HTTP API Client |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
HTTP API client for evaluating Whisper audio transcription models via tt-media-server backend.
Description
This module implements a WhisperTT model wrapper that uses HTTP calls to a tt-media-server instead of direct ttnn/tt-metal execution. This allows evaluations to run outside Docker containers while leveraging the performance benefits of the TT-NN hardware acceleration. The implementation handles audio encoding to base64 WAV format, asynchronous batch transcription requests, retry logic for robustness, and integration with the lmms-eval framework's distributed evaluation system via Accelerator.
Usage
Use this model wrapper when evaluating Whisper models (e.g., whisper-large-v3) on audio transcription tasks, running evaluations without Docker/TT-NN dependencies, or leveraging TT-NN hardware acceleration through a remote API endpoint. Set the OPENAI_API_BASE environment variable to point to your tt-media-server instance.
Code Reference
Source Location
- Repository: EvolvingLMMs_Lab_Lmms_eval
- File: lmms_eval/models/whisper_tt.py
- Lines: 1-356
Signature
@register_model("whisper_tt")
class WhisperTT(lmms):
def __init__(
self,
pretrained: str = "openai/whisper-large-v3",
device: str = "cuda",
device_map: str = "cuda",
batch_size: int = 1000,
use_cache: bool = True,
language: str = "en",
task: str = "transcribe",
base_url: str = None,
timeout: int = 300,
max_retries: int = 3,
num_concurrent: int = 1,
**kwargs,
) -> None
def encode_audio_to_base64_wav(
self,
audio_array: np.ndarray,
sampling_rate: int
) -> str
def transcribe_audio(
self,
audio_array: np.ndarray,
sampling_rate: int
) -> str
async def _generate_audio_transcription(
self,
session,
audio_array: np.ndarray,
sampling_rate: int,
audio_index: int = None
) -> str
def generate_until(self, requests: List[Instance]) -> List[str]
Import
from lmms_eval.models.whisper_tt import WhisperTT
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| pretrained | str | Yes | HuggingFace model identifier (e.g., "openai/whisper-large-v3") |
| base_url | str | No | HTTP endpoint for tt-media-server (default: from OPENAI_API_BASE env var) |
| language | str | No | Target language code (default: "en") |
| task | str | No | Task type: "transcribe" or "translate" (default: "transcribe") |
| batch_size | int | No | Batch size for evaluation (default: 1000) |
| timeout | int | No | Request timeout in seconds (default: 300) |
| max_retries | int | No | Maximum retry attempts (default: 3) |
Outputs
| Name | Type | Description |
|---|---|---|
| transcriptions | List[str] | List of transcribed text strings for each audio input |
Usage Examples
Basic Evaluation
# Set the API endpoint
export OPENAI_API_BASE="http://127.0.0.1:8000"
export OPENAI_API_KEY="your-secret-key"
# Run evaluation
python -m lmms_eval \
--model whisper_tt \
--model_args pretrained=openai/whisper-large-v3,language=en,task=transcribe,base_url=http://127.0.0.1:8000 \
--tasks librispeech \
--batch_size 1000 \
--device cuda:0
Programmatic Usage
from lmms_eval.models.whisper_tt import WhisperTT
import numpy as np
# Initialize model
model = WhisperTT(
pretrained="openai/whisper-large-v3",
base_url="http://127.0.0.1:8000",
language="en",
task="transcribe",
max_retries=5,
timeout=600
)
# Transcribe audio
audio_array = np.random.randn(16000 * 5) # 5 seconds at 16kHz
sampling_rate = 16000
transcription = model.transcribe_audio(audio_array, sampling_rate)
print(transcription)
Distributed Evaluation
# The model automatically uses Accelerate for multi-GPU setups
accelerate launch --num_processes=4 -m lmms_eval \
--model whisper_tt \
--model_args pretrained=openai/whisper-large-v3 \
--tasks librispeech \
--batch_size 1000
Implementation Details
Audio Encoding
Audio arrays are converted to float32 (not float64) to prevent "Unsupported bit depth: 64" errors on the server. The audio is written to an in-memory WAV buffer using scipy.io.wavfile, then base64-encoded for HTTP transmission.
Async Batch Processing
The generate_until method collects all audio samples first, then processes them in parallel using asyncio.gather() with aiohttp sessions. This achieves significantly better throughput than sequential processing.
Retry Logic
Synchronous transcribe_audio() includes retry logic with max_retries attempts. Asynchronous _generate_audio_transcription() logs errors but does not retry to avoid cascading delays.
Response Parsing
The server response is expected to be JSON with a "text", "transcription", or "result" key. If none are found, the entire response is returned as a string.
Distributed Setup
Uses Accelerator to automatically detect multi-process setups and assign appropriate device indices and rank/world_size values.