Principle:Elevenlabs Elevenlabs python Batch Speech to Text
| Knowledge Sources | |
|---|---|
| Domains | Speech_Recognition, NLP |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
A transcription process that converts a complete audio file or cloud-stored audio into text with optional word-level timestamps, speaker diarization, and audio event tagging.
Description
Batch Speech-to-Text (also called offline transcription) processes a complete audio recording and returns a structured transcript. Unlike real-time transcription, batch STT has access to the entire audio context, enabling higher accuracy, speaker diarization (identifying who said what), and precise word-level timestamps.
The ElevenLabs Scribe model supports:
- Multiple input sources: file upload or cloud storage URL (up to 2GB)
- Speaker diarization with configurable threshold and speaker count
- Word-level and character-level timestamps
- Audio event tagging (laughter, applause, etc.)
- Multi-channel transcription for stereo/multi-track audio
- Asynchronous processing via webhooks for large files
Usage
Use this principle when you have a complete audio file to transcribe and need the highest accuracy. Ideal for podcast transcription, meeting minutes, content indexing, subtitle generation, and any scenario where the full audio is available before transcription begins.
Theoretical Basis
Modern speech-to-text systems use encoder-decoder transformer architectures:
# Abstract STT pipeline
audio_features = audio_encoder(audio_waveform) # Mel spectrogram -> features
tokens = text_decoder(audio_features) # Autoregressive text generation
transcript = detokenize(tokens)
# Post-processing
timestamps = forced_alignment(audio, transcript) # Word-level timing
speakers = diarization_model(audio) # Speaker segmentation
Batch processing allows the model to use bidirectional context (future audio informs past transcription), which is not possible in streaming mode.