Workflow:Elevenlabs Elevenlabs python Speech to Text Transcription
| Knowledge Sources | |
|---|---|
| Domains | Speech_to_Text, Audio_Processing, Real_Time_Streaming |
| Last Updated | 2026-02-15 12:00 GMT |
Overview
End-to-end process for converting audio into text using the ElevenLabs Speech-to-Text API, supporting both batch file transcription and real-time WebSocket-based streaming transcription (Scribe).
Description
This workflow covers the two modes of speech-to-text conversion available in the ElevenLabs SDK. Batch mode accepts an audio file and returns a complete transcription with word-level timing information. Real-time mode (Scribe) establishes a WebSocket connection for streaming audio input, providing partial and committed transcripts as audio is processed. The real-time mode supports both URL-based audio streaming and manual audio chunk submission, with configurable audio format and sample rate parameters.
Usage
Execute this workflow when you need to convert spoken audio into text. Use batch mode for pre-recorded audio files (interviews, meetings, podcasts). Use real-time mode for live transcription of streaming audio (live events, phone calls, real-time captioning). The SDK supports multichannel audio for speaker-separated transcription scenarios.
Execution Steps
Step 1: Client Initialization
Create an ElevenLabs client instance with API key authentication. The speech-to-text client is automatically available as a sub-client on the main ElevenLabs client. The custom STT client extends the auto-generated client with real-time transcription capabilities via the ScribeRealtime class.
Key considerations:
- The speech_to_text sub-client is lazily loaded on first access
- Real-time STT (Scribe) is accessed via the realtime property on the STT client
- API key is extracted from client headers and passed to the ScribeRealtime instance
Step 2: Audio Source Selection
Determine the audio input source and transcription mode. For batch transcription, prepare an audio file (MP3, WAV, etc.). For real-time transcription, choose between URL-based streaming (the API fetches audio from a URL) or manual chunk submission (your application sends PCM audio data).
Key considerations:
- Batch mode accepts file paths or file-like objects
- Real-time URL mode streams audio directly from a provided URL
- Manual chunk mode requires specifying audio format (e.g., PCM 16000Hz) and sample rate
- Multichannel mode is available for speaker-separated transcription
Step 3: Batch Transcription
For pre-recorded audio, call the speech_to_text.convert endpoint with the audio file. The API returns a complete transcription response containing the full text, word-level timing (start time, end time, confidence), and character-level timing data.
Key considerations:
- The response includes word-level timestamps and confidence scores
- Additional metadata may include language detection results
- Large files may require extended timeout settings on the client
Step 4: Real-time Transcription Connection
For live audio, connect to the ScribeRealtime WebSocket endpoint. Configure event handlers for partial transcripts (interim results that may change) and committed transcripts (finalized text segments). The connection handles authentication, audio format negotiation, and session management.
Key considerations:
- Partial transcripts are preliminary and may be revised
- Committed transcripts are final and represent the definitive transcription
- Session configuration includes audio format, sample rate, and language settings
- Error events include auth errors, rate limiting, queue overflow, and session timeouts
Step 5: Transcript Processing
Process the transcription results. For batch mode, parse the complete response model containing text, words with timing, and characters with timing. For real-time mode, accumulate committed transcript segments and handle partial updates for live display.
Key considerations:
- Batch results include comprehensive timing data suitable for subtitle generation
- Real-time results stream incrementally and should be accumulated by the application
- Both modes provide language detection information when available