Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Elevenlabs Elevenlabs python Realtime TTS Streaming

From Leeroopedia
Knowledge Sources
Domains Audio_Generation, Text_to_Speech, Real_Time_Streaming
Last Updated 2026-02-15 12:00 GMT

Overview

End-to-end process for streaming text input to the ElevenLabs WebSocket API and receiving synthesized audio chunks in real time, enabling speech generation from progressively available text.

Description

This workflow implements real-time text-to-speech using the ElevenLabs WebSocket-based streaming input API. Unlike the standard TTS endpoint which requires complete text upfront, this approach accepts text as an iterator of string chunks that are streamed to the server over a persistent WebSocket connection. Audio chunks are returned as they become available, enabling minimal-latency speech output. The implementation includes intelligent text chunking that buffers input at natural sentence boundaries (punctuation marks) to optimize synthesis quality while maintaining low latency.

Usage

Execute this workflow when text is being generated progressively (e.g., from an LLM) and speech output needs to begin before the full text is available. This is ideal for chatbot interfaces, live narration of AI-generated content, real-time translation output, or any scenario where text arrives incrementally and audio playback should start immediately.

Execution Steps

Step 1: Client Initialization

Create an ElevenLabs client instance. The client automatically instantiates a RealtimeTextToSpeechClient that extends the standard TTS client with WebSocket capabilities. The WebSocket base URL is derived from the HTTP base URL by replacing the scheme with WSS.

Key considerations:

  • The realtime client extends the standard TextToSpeechClient, so all batch methods remain available
  • WebSocket URL is computed automatically from the configured base URL
  • API key authentication is passed via WebSocket headers

Step 2: Text Source Preparation

Prepare a text iterator (generator function or other iterator) that yields string chunks as they become available. The iterator represents the progressive text input that will be streamed to the API.

Key considerations:

  • Text can come from any source: LLM output, file reading, user input
  • Each yield produces a text fragment; the system handles buffering and chunking
  • The iterator signals completion when it is exhausted

Step 3: WebSocket Connection and Initial Handshake

Establish a WebSocket connection to the streaming input endpoint with the selected voice ID, model ID, and output format as URL parameters. Send an initial message with an empty text field to configure voice settings and generation parameters (chunk length schedule) for the session.

Key considerations:

  • Connection URL includes voice_id, model_id, and output_format as query parameters
  • Initial message establishes voice settings and generation configuration for the entire session
  • Connection errors are caught and raised as ApiError with appropriate status codes

Step 4: Streamed Text Input with Text Chunking

Feed text chunks through the text_chunker utility, which buffers input at natural sentence boundaries (periods, commas, question marks, semicolons, dashes, brackets, and spaces). Each buffered chunk is sent as a JSON message over the WebSocket. After each send, the client attempts a non-blocking receive to collect any available audio chunks.

Key considerations:

  • The text_chunker ensures chunks end at natural punctuation boundaries for better synthesis quality
  • Non-blocking receives (with 10ms timeout) allow interleaved audio collection during text streaming
  • Audio chunks arrive as base64-encoded data in JSON messages and are decoded to raw bytes

Step 5: Flush and Collect Remaining Audio

After all text has been sent, send an empty string message to signal the end of input. Then enter a blocking receive loop to collect all remaining audio chunks until the WebSocket connection closes. A normal closure (code 1000) indicates successful completion.

Key considerations:

  • The empty string message triggers generation of any remaining buffered audio on the server
  • The blocking receive loop ensures all audio is collected before the generator completes
  • Non-1000 close codes indicate errors and are raised as ApiError

Execution Diagram

GitHub URL

Workflow Repository