Implementation:Elevenlabs Elevenlabs python SpeechToTextClient Convert

Knowledge Sources	ElevenLabs Python ElevenLabs STT API
Domains	Speech_Recognition, NLP
Last Updated	2026-02-15 00:00 GMT

Overview

Concrete tool for batch audio-to-text transcription provided by the elevenlabs-python SDK.

Description

The SpeechToTextClient.convert method sends an audio file (or cloud storage URL) to the ElevenLabs Scribe API for transcription. It uses multipart file upload via the Fern-generated HTTP client and returns a structured response with transcript text, word-level timestamps, and optional speaker diarization labels.

The custom SpeechToTextClient in speech_to_text_custom.py extends the auto-generated client to add a .realtime property for WebSocket-based streaming STT.

Usage

Use this method when you have a complete audio file to transcribe. For real-time streaming transcription, use client.speech_to_text.realtime.connect() instead.

Code Reference

Source Location

Repository: elevenlabs-python
File: src/elevenlabs/speech_to_text/client.py
Lines: L42-66

Signature

def convert(
    self,
    *,
    model_id: SpeechToTextConvertRequestModelId,
    enable_logging: typing.Optional[bool] = None,
    file: typing.Optional[core.File] = OMIT,
    language_code: typing.Optional[str] = OMIT,
    tag_audio_events: typing.Optional[bool] = OMIT,
    num_speakers: typing.Optional[int] = OMIT,
    timestamps_granularity: typing.Optional[SpeechToTextConvertRequestTimestampsGranularity] = OMIT,
    diarize: typing.Optional[bool] = OMIT,
    diarization_threshold: typing.Optional[float] = OMIT,
    additional_formats: typing.Optional[AdditionalFormats] = OMIT,
    file_format: typing.Optional[SpeechToTextConvertRequestFileFormat] = OMIT,
    cloud_storage_url: typing.Optional[str] = OMIT,
    webhook: typing.Optional[bool] = OMIT,
    temperature: typing.Optional[float] = OMIT,
    seed: typing.Optional[int] = OMIT,
    use_multi_channel: typing.Optional[bool] = OMIT,
    keyterms: typing.Optional[typing.List[str]] = OMIT,
    request_options: typing.Optional[RequestOptions] = None,
) -> SpeechToTextConvertResponse:
    """Transcribe an audio or video file."""

Import

from elevenlabs import ElevenLabs

client = ElevenLabs()
# Access via: client.speech_to_text.convert(...)

I/O Contract

Inputs

Name	Type	Required	Description
model_id	SpeechToTextConvertRequestModelId	Yes	STT model (e.g., "scribe_v1")
file	Optional[core.File]	No*	Audio file to transcribe (*one of file or cloud_storage_url required)
cloud_storage_url	Optional[str]	No*	HTTPS URL of audio to transcribe (max 2GB)
language_code	Optional[str]	No	ISO language code hint for improved accuracy
diarize	Optional[bool]	No	Enable speaker diarization
num_speakers	Optional[int]	No	Expected number of speakers (max 32)
timestamps_granularity	Optional[str]	No	'word' or 'character' level timestamps
tag_audio_events	Optional[bool]	No	Tag audio events like (laughter), (applause)
diarization_threshold	Optional[float]	No	Diarization sensitivity threshold
use_multi_channel	Optional[bool]	No	Separate transcripts per audio channel
temperature	Optional[float]	No	Randomness control (0.0-2.0)
seed	Optional[int]	No	Deterministic seed (0-2147483647)
webhook	Optional[bool]	No	Process asynchronously, deliver via webhook
keyterms	Optional[List[str]]	No	Key terms to improve recognition accuracy

Outputs

Name	Type	Description
(return)	SpeechToTextConvertResponse	Contains transcript text, word-level timestamps (SpeechToTextWordResponseModel[]), speaker labels, language detection

Usage Examples

Basic Transcription

from elevenlabs import ElevenLabs

client = ElevenLabs()

result = client.speech_to_text.convert(
    model_id="scribe_v1",
    file=open("audio.mp3", "rb"),
)

print(result.text)

With Diarization and Timestamps

from elevenlabs import ElevenLabs

client = ElevenLabs()

result = client.speech_to_text.convert(
    model_id="scribe_v1",
    file=open("meeting.mp3", "rb"),
    diarize=True,
    num_speakers=3,
    timestamps_granularity="word",
    language_code="en",
    tag_audio_events=True,
)

print(result.text)

# Access word-level details
for word in result.words:
    print(f"[{word.start:.2f}s - {word.end:.2f}s] {word.text} (speaker: {word.speaker})")

From Cloud Storage URL

from elevenlabs import ElevenLabs

client = ElevenLabs()

result = client.speech_to_text.convert(
    model_id="scribe_v1",
    cloud_storage_url="https://storage.example.com/podcast-episode-42.mp3",
    language_code="en",
    diarize=True,
)

print(result.text)

Related Pages

Implements Principle

Principle:Elevenlabs_Elevenlabs_python_Batch_Speech_to_Text

Requires Environment

Environment:Elevenlabs_Elevenlabs_python_Python_Httpx

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment