Principle:Groq Groq python Audio Translation Request
| Knowledge Sources | |
|---|---|
| Domains | Audio, Translation |
| Last Updated | 2026-02-15 16:00 GMT |
Overview
Principle governing the translation of non-English audio content into English text using speech recognition models.
Description
Audio Translation converts spoken content in any supported language into English text. Unlike transcription (which preserves the original language), translation always produces English output. The process accepts audio input as either a file upload or a URL reference, sends it to a Whisper-family model hosted on Groq's inference infrastructure, and returns the translated English text. Key configuration options include model selection (affecting accuracy/speed tradeoffs), output format (JSON, plain text, or verbose JSON with timestamps), sampling temperature (controlling output randomness), and an optional English-language prompt to guide translation style.
Usage
Apply this principle when you need to convert non-English audio (meetings, podcasts, recordings) into English text. Choose audio translation over transcription when the source language differs from English and you need English output. For same-language transcription, use the Audio Transcription Request principle instead.
Theoretical Basis
Audio translation follows a two-stage pipeline:
# Abstract algorithm
def translate_audio(audio, model, params):
# Stage 1: Speech recognition
# The Whisper model processes audio features and decodes tokens
# Stage 2: Cross-lingual generation
# Unlike transcription, the decoder is conditioned to produce English tokens
# regardless of the input language
# The model uses:
# - Log-mel spectrogram features from the audio
# - Language detection (implicit)
# - English-conditioned decoding
return english_text
Key parameters:
- Model selection: whisper-large-v3 (higher accuracy) vs whisper-large-v3-turbo (faster)
- Temperature: 0 uses greedy decoding with log-probability fallback; higher values increase diversity
- Prompt conditioning: English text that biases the decoder toward specific vocabulary or style