Implementation:Elevenlabs Elevenlabs python SpeechToTextClient Convert
| Knowledge Sources | |
|---|---|
| Domains | Speech_Recognition, NLP |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Concrete tool for batch audio-to-text transcription provided by the elevenlabs-python SDK.
Description
The SpeechToTextClient.convert method sends an audio file (or cloud storage URL) to the ElevenLabs Scribe API for transcription. It uses multipart file upload via the Fern-generated HTTP client and returns a structured response with transcript text, word-level timestamps, and optional speaker diarization labels.
The custom SpeechToTextClient in speech_to_text_custom.py extends the auto-generated client to add a .realtime property for WebSocket-based streaming STT.
Usage
Use this method when you have a complete audio file to transcribe. For real-time streaming transcription, use client.speech_to_text.realtime.connect() instead.
Code Reference
Source Location
- Repository: elevenlabs-python
- File: src/elevenlabs/speech_to_text/client.py
- Lines: L42-66
Signature
def convert(
self,
*,
model_id: SpeechToTextConvertRequestModelId,
enable_logging: typing.Optional[bool] = None,
file: typing.Optional[core.File] = OMIT,
language_code: typing.Optional[str] = OMIT,
tag_audio_events: typing.Optional[bool] = OMIT,
num_speakers: typing.Optional[int] = OMIT,
timestamps_granularity: typing.Optional[SpeechToTextConvertRequestTimestampsGranularity] = OMIT,
diarize: typing.Optional[bool] = OMIT,
diarization_threshold: typing.Optional[float] = OMIT,
additional_formats: typing.Optional[AdditionalFormats] = OMIT,
file_format: typing.Optional[SpeechToTextConvertRequestFileFormat] = OMIT,
cloud_storage_url: typing.Optional[str] = OMIT,
webhook: typing.Optional[bool] = OMIT,
temperature: typing.Optional[float] = OMIT,
seed: typing.Optional[int] = OMIT,
use_multi_channel: typing.Optional[bool] = OMIT,
keyterms: typing.Optional[typing.List[str]] = OMIT,
request_options: typing.Optional[RequestOptions] = None,
) -> SpeechToTextConvertResponse:
"""Transcribe an audio or video file."""
Import
from elevenlabs import ElevenLabs
client = ElevenLabs()
# Access via: client.speech_to_text.convert(...)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_id | SpeechToTextConvertRequestModelId | Yes | STT model (e.g., "scribe_v1") |
| file | Optional[core.File] | No* | Audio file to transcribe (*one of file or cloud_storage_url required) |
| cloud_storage_url | Optional[str] | No* | HTTPS URL of audio to transcribe (max 2GB) |
| language_code | Optional[str] | No | ISO language code hint for improved accuracy |
| diarize | Optional[bool] | No | Enable speaker diarization |
| num_speakers | Optional[int] | No | Expected number of speakers (max 32) |
| timestamps_granularity | Optional[str] | No | 'word' or 'character' level timestamps |
| tag_audio_events | Optional[bool] | No | Tag audio events like (laughter), (applause) |
| diarization_threshold | Optional[float] | No | Diarization sensitivity threshold |
| use_multi_channel | Optional[bool] | No | Separate transcripts per audio channel |
| temperature | Optional[float] | No | Randomness control (0.0-2.0) |
| seed | Optional[int] | No | Deterministic seed (0-2147483647) |
| webhook | Optional[bool] | No | Process asynchronously, deliver via webhook |
| keyterms | Optional[List[str]] | No | Key terms to improve recognition accuracy |
Outputs
| Name | Type | Description |
|---|---|---|
| (return) | SpeechToTextConvertResponse | Contains transcript text, word-level timestamps (SpeechToTextWordResponseModel[]), speaker labels, language detection |
Usage Examples
Basic Transcription
from elevenlabs import ElevenLabs
client = ElevenLabs()
result = client.speech_to_text.convert(
model_id="scribe_v1",
file=open("audio.mp3", "rb"),
)
print(result.text)
With Diarization and Timestamps
from elevenlabs import ElevenLabs
client = ElevenLabs()
result = client.speech_to_text.convert(
model_id="scribe_v1",
file=open("meeting.mp3", "rb"),
diarize=True,
num_speakers=3,
timestamps_granularity="word",
language_code="en",
tag_audio_events=True,
)
print(result.text)
# Access word-level details
for word in result.words:
print(f"[{word.start:.2f}s - {word.end:.2f}s] {word.text} (speaker: {word.speaker})")
From Cloud Storage URL
from elevenlabs import ElevenLabs
client = ElevenLabs()
result = client.speech_to_text.convert(
model_id="scribe_v1",
cloud_storage_url="https://storage.example.com/podcast-episode-42.mp3",
language_code="en",
diarize=True,
)
print(result.text)