Workflow:Groq Groq python Audio Transcription

Knowledge Sources	Groq Python SDK Groq API Docs
Domains	Audio, Speech_to_Text, Inference
Last Updated	2026-02-15 16:00 GMT

Overview

End-to-end process for transcribing audio files to text using Groq-hosted Whisper models.

Description

This workflow covers the procedure for converting audio recordings into text transcriptions using Groq's audio transcription API. It uses Whisper-based models (whisper-large-v3 or whisper-large-v3-turbo) to process uploaded audio files and return text output. The workflow handles file upload via multipart form data, language detection or specification, and response format selection. It supports multiple audio formats and over 50 languages.

Usage

Execute this workflow when you have an audio file (speech recording, podcast, meeting, etc.) and need to produce a text transcription. This is appropriate for speech-to-text pipelines, meeting transcription services, subtitle generation, or any application that needs to convert spoken language to written text.

Execution Steps

Step 1: Client Initialization

Instantiate the Groq client with authentication credentials. The transcription API uses the same client as chat completions, so configuration (API key, timeouts, retries) is shared.

Key considerations:

Same Groq() or AsyncGroq() client used for all API endpoints
File uploads may require longer timeouts depending on audio file size

Step 2: Audio File Preparation

Prepare the audio file for upload. The file can be provided as a file path (PathLike), raw bytes, or a tuple of (filename, contents, media_type). Supported formats include common audio types. Alternatively, a URL to the audio can be provided instead of a file upload.

Key considerations:

File can be a Path object, bytes, or (filename, content, media_type) tuple
Either file or url parameter must be provided, not both
Audio files should be in a supported format (mp3, wav, flac, etc.)

Step 3: Transcription Request

Call the audio transcriptions create endpoint with the audio file, model selection, and optional parameters. The model processes the audio and returns the transcription. Optional parameters include language hint, response format, temperature for sampling, and timestamp granularity.

Key considerations:

Model must be a valid Whisper model (whisper-large-v3, whisper-large-v3-turbo)
Language parameter is optional; the model auto-detects if not specified
Response format options affect the structure of the returned data
Temperature controls randomness in the transcription output

Step 4: Result Extraction

Parse the Transcription response object to extract the transcribed text. The response contains the text field with the full transcription. Depending on the response format requested, additional metadata such as word-level timestamps may be available.

Key considerations:

The primary output is the text field on the Transcription object
Timestamp granularity options provide word-level or segment-level timing

Execution Diagram

GitHub URL

Workflow Repository