Implementation:Neuml Txtai Microphone
| Knowledge Sources | |
|---|---|
| Domains | Audio, Voice_Detection |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
The Microphone class captures speech audio from a microphone device using an ensemble voice activity detection (VAD) approach that combines WebRTC VAD, butterworth band-pass filtering, and energy-based detection.
Description
The Microphone class inherits from Pipeline and provides robust real-time speech capture. It employs a three-pronged voice activity detection strategy: WebRTC VAD for neural-network-based speech detection, a butterworth band-pass filter targeting the human voice frequency range (configurable via voicestart/voiceend parameters), and energy-based detection that measures signal power. The ensemble approach reduces false positives and false negatives compared to any single method. The class manages the audio capture lifecycle, including detecting speech onset, tracking active speech segments, and determining when speech has ended based on configurable pause thresholds.
Usage
Use the Microphone class when you need to capture spoken audio from a microphone for downstream processing such as speech-to-text transcription, voice commands, or audio analysis. It is particularly well-suited for interactive applications where you need reliable speech endpoint detection without manual start/stop controls. Access it through txtai's pipeline system.
Code Reference
Source Location
- Repository: Neuml_Txtai
- File: src/python/txtai/pipeline/audio/microphone.py
- Lines: 1-244
Signature
class Microphone(Pipeline):
def __init__(self, rate=16000, vadmode=3, vadframe=20, vadthreshold=0.6,
voicestart=300, voiceend=3400, active=5, pause=8):
"""
Creates a Microphone pipeline for speech capture.
Args:
rate: audio sample rate in Hz (default: 16000)
vadmode: WebRTC VAD aggressiveness mode 0-3 (default: 3, most aggressive)
vadframe: VAD frame duration in milliseconds, must be 10/20/30 (default: 20)
vadthreshold: ensemble VAD threshold for speech detection (default: 0.6)
voicestart: low frequency cutoff in Hz for band-pass filter (default: 300)
voiceend: high frequency cutoff in Hz for band-pass filter (default: 3400)
active: number of consecutive speech frames to confirm speech onset (default: 5)
pause: number of consecutive silence frames to confirm speech end (default: 8)
"""
def __call__(self, device=None):
"""
Captures speech audio from the microphone.
Args:
device: audio input device index or None for default device
Returns:
tuple of (audio_data, sample_rate) where audio_data is a numpy array
"""
def listen(self, device):
"""Opens the audio stream and listens for speech segments."""
def isspeech(self, frame):
"""Determines if an audio frame contains speech using ensemble VAD."""
def detect(self, frame):
"""Runs WebRTC VAD detection on a single frame."""
def detectband(self, frame):
"""Applies butterworth band-pass filter and checks for voice frequencies."""
def detectenergy(self, frame):
"""Measures frame energy to detect speech presence."""
Import
from txtai.pipeline import Microphone
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| rate | int | No | Audio sample rate in Hz (default: 16000) |
| vadmode | int | No | WebRTC VAD aggressiveness, 0 (least) to 3 (most aggressive), default: 3 |
| vadframe | int | No | VAD frame duration in milliseconds, must be 10, 20, or 30 (default: 20) |
| vadthreshold | float | No | Ensemble threshold for classifying a frame as speech, range 0.0-1.0 (default: 0.6) |
| voicestart | int | No | Low frequency cutoff in Hz for the butterworth band-pass filter (default: 300) |
| voiceend | int | No | High frequency cutoff in Hz for the butterworth band-pass filter (default: 3400) |
| active | int | No | Number of consecutive speech frames required to confirm speech onset (default: 5) |
| pause | int | No | Number of consecutive silence frames required to confirm speech end (default: 8) |
| device | int | No (for __call__) | Audio input device index, or None to use the system default microphone |
Outputs
| Name | Type | Description |
|---|---|---|
| audio_data | numpy.ndarray | Captured speech audio as a numpy array of audio samples |
| sample_rate | int | Sample rate of the captured audio in Hz (matches the configured rate) |
Usage Examples
Basic Usage
from txtai.pipeline import Microphone
# Create microphone with default settings
microphone = Microphone()
# Capture speech from the default microphone
# Blocks until speech is detected and completed
audio, rate = microphone()
print(f"Captured {len(audio)} samples at {rate} Hz")
print(f"Duration: {len(audio) / rate:.2f} seconds")
Custom VAD Settings
from txtai.pipeline import Microphone
# Create microphone with relaxed VAD for noisy environments
microphone = Microphone(
rate=16000,
vadmode=2, # Less aggressive VAD
vadthreshold=0.5, # Lower threshold for speech detection
voicestart=200, # Wider frequency range
voiceend=4000,
active=3, # Fewer frames needed to confirm speech
pause=12 # More silence needed to end capture
)
# Capture from a specific audio device
audio, rate = microphone(device=1)
print(f"Captured {len(audio) / rate:.2f} seconds of audio")
Speech-to-Text Pipeline
from txtai.pipeline import Microphone, Transcription
# Create pipelines
microphone = Microphone(rate=16000, vadmode=3)
transcribe = Transcription("openai/whisper-base")
# Capture and transcribe speech
audio, rate = microphone()
text = transcribe(audio)
print(f"You said: {text}")