Implementation:Elevenlabs Elevenlabs python SpeechToTextWordResponseModel

Field	Value
source	Elevenlabs_Elevenlabs_python
domains	Speech-to-Text, Transcription, Timestamps
last_updated	2026-02-15

Overview

Description

SpeechToTextWordResponseModel is a Pydantic model representing word-level detail of a transcription with timing information. Each instance corresponds to a single word or audio event (such as laughter or footsteps) that was transcribed, including its start and end timestamps, type classification, speaker identification, confidence score (logprob), and optional character-level breakdown. This model is auto-generated by Fern from the ElevenLabs API definition and extends UncheckedBaseModel.

Usage

This model is returned as part of speech-to-text transcription responses from the ElevenLabs API. It provides granular word-level timing and confidence data, which is useful for applications requiring precise transcript alignment, speaker diarization, or confidence filtering.

Code Reference

Source Location

src/elevenlabs/types/speech_to_text_word_response_model.py

Class Signature

class SpeechToTextWordResponseModel(UncheckedBaseModel):
    """
    Word-level detail of the transcription with timing information.
    """
    ...

Import Statement

from elevenlabs.types import SpeechToTextWordResponseModel

I/O Contract

Field	Type	Required	Description
text	`str`	Yes	The word or sound that was transcribed.
start	`Optional[float]`	No	The start time of the word or sound in seconds.
end	`Optional[float]`	No	The end time of the word or sound in seconds.
type	`SpeechToTextWordResponseModelType`	Yes	The type of the word or sound. 'audio_event' is used for non-word sounds like laughter or footsteps.
speaker_id	`Optional[str]`	No	Unique identifier for the speaker of this word.
logprob	`float`	Yes	The log of the probability with which this word was predicted. Logprobs are in range [-infinity, 0]; higher logprobs indicate higher confidence.
characters	`Optional[List[SpeechToTextCharacterResponseModel]]`	No	The characters that make up the word and their timing information.

Usage Examples

from elevenlabs.types import SpeechToTextWordResponseModel

# Typically received as part of a transcription response
word = SpeechToTextWordResponseModel(
    text="hello",
    start=0.5,
    end=0.9,
    type="word",
    speaker_id="speaker_1",
    logprob=-0.12,
)

# Check confidence level
import math
confidence = math.exp(word.logprob)
print(f"Word: '{word.text}', Confidence: {confidence:.2%}")

# Access timing information
if word.start is not None and word.end is not None:
    duration = word.end - word.start
    print(f"Duration: {duration:.3f}s")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment