Workflow:Elevenlabs Elevenlabs python Speech to Text Transcription

Knowledge Sources	ElevenLabs Python SDK ElevenLabs API Reference
Domains	Speech_to_Text, Audio_Processing, Real_Time_Streaming
Last Updated	2026-02-15 12:00 GMT

Overview

End-to-end process for converting audio into text using the ElevenLabs Speech-to-Text API, supporting both batch file transcription and real-time WebSocket-based streaming transcription (Scribe).

Description

This workflow covers the two modes of speech-to-text conversion available in the ElevenLabs SDK. Batch mode accepts an audio file and returns a complete transcription with word-level timing information. Real-time mode (Scribe) establishes a WebSocket connection for streaming audio input, providing partial and committed transcripts as audio is processed. The real-time mode supports both URL-based audio streaming and manual audio chunk submission, with configurable audio format and sample rate parameters.

Usage

Execute this workflow when you need to convert spoken audio into text. Use batch mode for pre-recorded audio files (interviews, meetings, podcasts). Use real-time mode for live transcription of streaming audio (live events, phone calls, real-time captioning). The SDK supports multichannel audio for speaker-separated transcription scenarios.

Execution Steps

Step 1: Client Initialization

Create an ElevenLabs client instance with API key authentication. The speech-to-text client is automatically available as a sub-client on the main ElevenLabs client. The custom STT client extends the auto-generated client with real-time transcription capabilities via the ScribeRealtime class.

Key considerations:

The speech_to_text sub-client is lazily loaded on first access
Real-time STT (Scribe) is accessed via the realtime property on the STT client
API key is extracted from client headers and passed to the ScribeRealtime instance

Step 2: Audio Source Selection

Determine the audio input source and transcription mode. For batch transcription, prepare an audio file (MP3, WAV, etc.). For real-time transcription, choose between URL-based streaming (the API fetches audio from a URL) or manual chunk submission (your application sends PCM audio data).

Key considerations:

Batch mode accepts file paths or file-like objects
Real-time URL mode streams audio directly from a provided URL
Manual chunk mode requires specifying audio format (e.g., PCM 16000Hz) and sample rate
Multichannel mode is available for speaker-separated transcription

Step 3: Batch Transcription

For pre-recorded audio, call the speech_to_text.convert endpoint with the audio file. The API returns a complete transcription response containing the full text, word-level timing (start time, end time, confidence), and character-level timing data.

Key considerations:

The response includes word-level timestamps and confidence scores
Additional metadata may include language detection results
Large files may require extended timeout settings on the client

Step 4: Real-time Transcription Connection

For live audio, connect to the ScribeRealtime WebSocket endpoint. Configure event handlers for partial transcripts (interim results that may change) and committed transcripts (finalized text segments). The connection handles authentication, audio format negotiation, and session management.

Key considerations:

Partial transcripts are preliminary and may be revised
Committed transcripts are final and represent the definitive transcription
Session configuration includes audio format, sample rate, and language settings
Error events include auth errors, rate limiting, queue overflow, and session timeouts

Step 5: Transcript Processing

Process the transcription results. For batch mode, parse the complete response model containing text, words with timing, and characters with timing. For real-time mode, accumulate committed transcript segments and handle partial updates for live display.

Key considerations:

Batch results include comprehensive timing data suitable for subtitle generation
Real-time results stream incrementally and should be accumulated by the application
Both modes provide language detection information when available

Execution Diagram

GitHub URL

Workflow Repository