Principle:Groq Groq python Audio Transcription Request
| Knowledge Sources | |
|---|---|
| Domains | Audio, Speech_Recognition |
| Last Updated | 2026-02-15 16:00 GMT |
Overview
The process of submitting audio data to a speech recognition model for automatic conversion to text.
Description
Audio Transcription uses a speech recognition model (such as OpenAI Whisper) to convert audio content into text. The request includes the audio file, model selection, and optional parameters for language, prompt guidance, output format, and timestamp granularity.
Key features:
- Language specification: Providing the ISO-639-1 language code improves accuracy
- Prompt guidance: An optional text prompt steers the transcription style and vocabulary
- Output formats: Plain text, JSON, or verbose JSON with timestamps
- Timestamp granularity: Word-level or segment-level timing information
Usage
Use this principle when you need to convert audio recordings (meetings, interviews, podcasts, voice notes) to text. Specify the language if known for improved accuracy.
Theoretical Basis
Audio transcription uses encoder-decoder transformer architecture (Whisper). The audio waveform is converted to mel-spectrogram features, processed by the encoder, then decoded autoregressively into text tokens:
# Abstract transcription pipeline
spectrogram = audio_to_mel(audio_file)
features = encoder(spectrogram)
text = decoder(features, language=lang, prompt=prompt)