Workflow:Groq Groq python Audio Transcription
| Knowledge Sources | |
|---|---|
| Domains | Audio, Speech_to_Text, Inference |
| Last Updated | 2026-02-15 16:00 GMT |
Overview
End-to-end process for transcribing audio files to text using Groq-hosted Whisper models.
Description
This workflow covers the procedure for converting audio recordings into text transcriptions using Groq's audio transcription API. It uses Whisper-based models (whisper-large-v3 or whisper-large-v3-turbo) to process uploaded audio files and return text output. The workflow handles file upload via multipart form data, language detection or specification, and response format selection. It supports multiple audio formats and over 50 languages.
Usage
Execute this workflow when you have an audio file (speech recording, podcast, meeting, etc.) and need to produce a text transcription. This is appropriate for speech-to-text pipelines, meeting transcription services, subtitle generation, or any application that needs to convert spoken language to written text.
Execution Steps
Step 1: Client Initialization
Instantiate the Groq client with authentication credentials. The transcription API uses the same client as chat completions, so configuration (API key, timeouts, retries) is shared.
Key considerations:
- Same Groq() or AsyncGroq() client used for all API endpoints
- File uploads may require longer timeouts depending on audio file size
Step 2: Audio File Preparation
Prepare the audio file for upload. The file can be provided as a file path (PathLike), raw bytes, or a tuple of (filename, contents, media_type). Supported formats include common audio types. Alternatively, a URL to the audio can be provided instead of a file upload.
Key considerations:
- File can be a Path object, bytes, or (filename, content, media_type) tuple
- Either file or url parameter must be provided, not both
- Audio files should be in a supported format (mp3, wav, flac, etc.)
Step 3: Transcription Request
Call the audio transcriptions create endpoint with the audio file, model selection, and optional parameters. The model processes the audio and returns the transcription. Optional parameters include language hint, response format, temperature for sampling, and timestamp granularity.
Key considerations:
- Model must be a valid Whisper model (whisper-large-v3, whisper-large-v3-turbo)
- Language parameter is optional; the model auto-detects if not specified
- Response format options affect the structure of the returned data
- Temperature controls randomness in the transcription output
Step 4: Result Extraction
Parse the Transcription response object to extract the transcribed text. The response contains the text field with the full transcription. Depending on the response format requested, additional metadata such as word-level timestamps may be available.
Key considerations:
- The primary output is the text field on the Transcription object
- Timestamp granularity options provide word-level or segment-level timing