Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Elevenlabs Elevenlabs python SpeechToTextClient Convert

From Leeroopedia
Knowledge Sources
Domains Speech_Recognition, NLP
Last Updated 2026-02-15 00:00 GMT

Overview

Concrete tool for batch audio-to-text transcription provided by the elevenlabs-python SDK.

Description

The SpeechToTextClient.convert method sends an audio file (or cloud storage URL) to the ElevenLabs Scribe API for transcription. It uses multipart file upload via the Fern-generated HTTP client and returns a structured response with transcript text, word-level timestamps, and optional speaker diarization labels.

The custom SpeechToTextClient in speech_to_text_custom.py extends the auto-generated client to add a .realtime property for WebSocket-based streaming STT.

Usage

Use this method when you have a complete audio file to transcribe. For real-time streaming transcription, use client.speech_to_text.realtime.connect() instead.

Code Reference

Source Location

  • Repository: elevenlabs-python
  • File: src/elevenlabs/speech_to_text/client.py
  • Lines: L42-66

Signature

def convert(
    self,
    *,
    model_id: SpeechToTextConvertRequestModelId,
    enable_logging: typing.Optional[bool] = None,
    file: typing.Optional[core.File] = OMIT,
    language_code: typing.Optional[str] = OMIT,
    tag_audio_events: typing.Optional[bool] = OMIT,
    num_speakers: typing.Optional[int] = OMIT,
    timestamps_granularity: typing.Optional[SpeechToTextConvertRequestTimestampsGranularity] = OMIT,
    diarize: typing.Optional[bool] = OMIT,
    diarization_threshold: typing.Optional[float] = OMIT,
    additional_formats: typing.Optional[AdditionalFormats] = OMIT,
    file_format: typing.Optional[SpeechToTextConvertRequestFileFormat] = OMIT,
    cloud_storage_url: typing.Optional[str] = OMIT,
    webhook: typing.Optional[bool] = OMIT,
    temperature: typing.Optional[float] = OMIT,
    seed: typing.Optional[int] = OMIT,
    use_multi_channel: typing.Optional[bool] = OMIT,
    keyterms: typing.Optional[typing.List[str]] = OMIT,
    request_options: typing.Optional[RequestOptions] = None,
) -> SpeechToTextConvertResponse:
    """Transcribe an audio or video file."""

Import

from elevenlabs import ElevenLabs

client = ElevenLabs()
# Access via: client.speech_to_text.convert(...)

I/O Contract

Inputs

Name Type Required Description
model_id SpeechToTextConvertRequestModelId Yes STT model (e.g., "scribe_v1")
file Optional[core.File] No* Audio file to transcribe (*one of file or cloud_storage_url required)
cloud_storage_url Optional[str] No* HTTPS URL of audio to transcribe (max 2GB)
language_code Optional[str] No ISO language code hint for improved accuracy
diarize Optional[bool] No Enable speaker diarization
num_speakers Optional[int] No Expected number of speakers (max 32)
timestamps_granularity Optional[str] No 'word' or 'character' level timestamps
tag_audio_events Optional[bool] No Tag audio events like (laughter), (applause)
diarization_threshold Optional[float] No Diarization sensitivity threshold
use_multi_channel Optional[bool] No Separate transcripts per audio channel
temperature Optional[float] No Randomness control (0.0-2.0)
seed Optional[int] No Deterministic seed (0-2147483647)
webhook Optional[bool] No Process asynchronously, deliver via webhook
keyterms Optional[List[str]] No Key terms to improve recognition accuracy

Outputs

Name Type Description
(return) SpeechToTextConvertResponse Contains transcript text, word-level timestamps (SpeechToTextWordResponseModel[]), speaker labels, language detection

Usage Examples

Basic Transcription

from elevenlabs import ElevenLabs

client = ElevenLabs()

result = client.speech_to_text.convert(
    model_id="scribe_v1",
    file=open("audio.mp3", "rb"),
)

print(result.text)

With Diarization and Timestamps

from elevenlabs import ElevenLabs

client = ElevenLabs()

result = client.speech_to_text.convert(
    model_id="scribe_v1",
    file=open("meeting.mp3", "rb"),
    diarize=True,
    num_speakers=3,
    timestamps_granularity="word",
    language_code="en",
    tag_audio_events=True,
)

print(result.text)

# Access word-level details
for word in result.words:
    print(f"[{word.start:.2f}s - {word.end:.2f}s] {word.text} (speaker: {word.speaker})")

From Cloud Storage URL

from elevenlabs import ElevenLabs

client = ElevenLabs()

result = client.speech_to_text.convert(
    model_id="scribe_v1",
    cloud_storage_url="https://storage.example.com/podcast-episode-42.mp3",
    language_code="en",
    diarize=True,
)

print(result.text)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment