Implementation:Elevenlabs Elevenlabs python SpeechToTextWordResponseModel
| Field | Value |
|---|---|
| source | Elevenlabs_Elevenlabs_python |
| domains | Speech-to-Text, Transcription, Timestamps |
| last_updated | 2026-02-15 |
Overview
Description
SpeechToTextWordResponseModel is a Pydantic model representing word-level detail of a transcription with timing information. Each instance corresponds to a single word or audio event (such as laughter or footsteps) that was transcribed, including its start and end timestamps, type classification, speaker identification, confidence score (logprob), and optional character-level breakdown. This model is auto-generated by Fern from the ElevenLabs API definition and extends UncheckedBaseModel.
Usage
This model is returned as part of speech-to-text transcription responses from the ElevenLabs API. It provides granular word-level timing and confidence data, which is useful for applications requiring precise transcript alignment, speaker diarization, or confidence filtering.
Code Reference
Source Location
src/elevenlabs/types/speech_to_text_word_response_model.py
Class Signature
class SpeechToTextWordResponseModel(UncheckedBaseModel):
"""
Word-level detail of the transcription with timing information.
"""
...
Import Statement
from elevenlabs.types import SpeechToTextWordResponseModel
I/O Contract
| Field | Type | Required | Description |
|---|---|---|---|
| text | str |
Yes | The word or sound that was transcribed. |
| start | Optional[float] |
No | The start time of the word or sound in seconds. |
| end | Optional[float] |
No | The end time of the word or sound in seconds. |
| type | SpeechToTextWordResponseModelType |
Yes | The type of the word or sound. 'audio_event' is used for non-word sounds like laughter or footsteps. |
| speaker_id | Optional[str] |
No | Unique identifier for the speaker of this word. |
| logprob | float |
Yes | The log of the probability with which this word was predicted. Logprobs are in range [-infinity, 0]; higher logprobs indicate higher confidence. |
| characters | Optional[List[SpeechToTextCharacterResponseModel]] |
No | The characters that make up the word and their timing information. |
Usage Examples
from elevenlabs.types import SpeechToTextWordResponseModel
# Typically received as part of a transcription response
word = SpeechToTextWordResponseModel(
text="hello",
start=0.5,
end=0.9,
type="word",
speaker_id="speaker_1",
logprob=-0.12,
)
# Check confidence level
import math
confidence = math.exp(word.logprob)
print(f"Word: '{word.text}', Confidence: {confidence:.2%}")
# Access timing information
if word.start is not None and word.end is not None:
duration = word.end - word.start
print(f"Duration: {duration:.3f}s")