Heuristic:Elevenlabs Elevenlabs python Audio Buffer Sizes
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Audio_IO, Conversational_AI |
| Last Updated | 2026-02-15 12:00 GMT |
Overview
Optimized audio buffer sizes for real-time conversational AI: 4000 samples (250ms) for input, 1000 samples (62.5ms) for output at 16kHz.
Description
The `DefaultAudioInterface` uses asymmetric buffer sizes for microphone input and speaker output streams. Input uses 4000 frames per buffer (250ms at 16kHz) to capture enough speech context for the VAD and ASR systems. Output uses 1000 frames per buffer (62.5ms at 16kHz) for lower latency audio playback, enabling faster agent interruption handling. Both streams use 16-bit PCM mono at 16kHz, which is the required format for the ConvAI WebSocket protocol.
Usage
Use these buffer sizes when implementing a custom AudioInterface for ElevenLabs Conversational AI. The input buffer should be large enough for meaningful speech chunks (recommended 250ms) while the output buffer should be small enough for responsive interruption handling.
The Insight (Rule of Thumb)
- Input buffer: 4000 samples = 250ms at 16kHz. Provides sufficient audio context for voice activity detection.
- Output buffer: 1000 samples = 62.5ms at 16kHz. Enables fast interruption response by minimizing buffered audio.
- Audio format: 16-bit PCM mono, 16kHz sample rate. This is a hard requirement of the ConvAI protocol.
- Trade-off: Larger input buffers improve recognition accuracy but increase perceived latency. Smaller output buffers improve interruption responsiveness but may cause audio stuttering on slow systems.
- Output queue: Audio output is buffered in a thread-safe queue with a 250ms poll timeout, allowing graceful shutdown detection.
Reasoning
The asymmetric buffer design reflects the different latency requirements of input vs output:
Input (4000 samples / 250ms): The server-side VAD and ASR need coherent speech segments. Sending very small chunks (e.g., 10ms) wastes bandwidth and can confuse speech detection. 250ms is the recommended chunk size documented in the `AudioInterface.start()` docstring.
Output (1000 samples / 62.5ms): When the user interrupts the agent, all buffered audio must be discarded immediately. A smaller output buffer means less audio is "in flight" between the queue and the speaker, resulting in faster perceived interruption. The `interrupt()` method drains the output queue to achieve this.
Code Evidence
Buffer size constants in `default_audio_interface.py:10-11`:
class DefaultAudioInterface(AudioInterface):
INPUT_FRAMES_PER_BUFFER = 4000 # 250ms @ 16kHz
OUTPUT_FRAMES_PER_BUFFER = 1000 # 62.5ms @ 16kHz
Recommended chunk size documentation in `conversation.py:86-88`:
"""Starts the audio interface.
...
The audio should be in 16-bit PCM mono format at 16kHz. Recommended
chunk size is 4000 samples (250 milliseconds).
"""
Output queue timeout in `default_audio_interface.py:73-79`:
def _output_thread(self):
while not self.should_stop.is_set():
try:
audio = self.output_queue.get(timeout=0.25)
self.out_stream.write(audio)
except queue.Empty:
pass
Interruption handling in `default_audio_interface.py:62-71`:
def interrupt(self):
# Clear the output queue to stop any audio that is currently playing.
try:
while True:
_ = self.output_queue.get(block=False)
except queue.Empty:
pass