Principle:Elevenlabs Elevenlabs python Realtime Speech to Text
| Knowledge Sources | |
|---|---|
| Domains | Speech_Recognition, Streaming, WebSocket |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
A streaming transcription technique that converts audio input to text in real time over a WebSocket connection, providing both partial (interim) and committed (final) transcript results.
Description
Realtime Speech-to-Text enables live transcription of audio as it is captured. Unlike batch STT which requires the complete audio file, realtime STT processes audio chunks as they arrive and provides two types of results:
- Partial transcripts: Interim results that update as more audio arrives (useful for live captions)
- Committed transcripts: Finalized results after a speech segment ends (triggered by VAD or manual commit)
The system supports two input modes:
- Manual audio chunks: Client sends base64-encoded PCM audio directly
- URL streaming: Client provides an audio URL and ffmpeg handles conversion and streaming
Commit strategy can be VAD (voice activity detection - automatic) or MANUAL (client decides when to commit).
Usage
Use this principle when you need live transcription of ongoing audio, such as live captioning, real-time translation, voice command detection, live meeting transcription, or streaming audio analysis.
Theoretical Basis
Realtime STT uses an incremental decoding approach:
# Abstract streaming STT pipeline
ws = connect(stt_endpoint, model_id, audio_format)
for audio_chunk in audio_source:
ws.send(audio_chunk)
# Server emits partial_transcript events as audio is processed
# Server emits committed_transcript when speech segment ends (VAD)
# Manual commit (if using MANUAL strategy):
ws.send(commit_signal)
# Server emits committed_transcript with final result
Key tradeoffs:
- Partial transcripts are fast but may change as more context arrives
- Committed transcripts are final and more accurate but have higher latency
- VAD commit is automatic but adds silence detection latency
- Manual commit gives client control but requires explicit segmentation