Implementation:Neuml Txtai Microphone

Knowledge Sources	Neuml_Txtai
Domains	Audio, Voice_Detection
Last Updated	2026-02-09 17:00 GMT

Overview

The Microphone class captures speech audio from a microphone device using an ensemble voice activity detection (VAD) approach that combines WebRTC VAD, butterworth band-pass filtering, and energy-based detection.

Description

The Microphone class inherits from Pipeline and provides robust real-time speech capture. It employs a three-pronged voice activity detection strategy: WebRTC VAD for neural-network-based speech detection, a butterworth band-pass filter targeting the human voice frequency range (configurable via voicestart/voiceend parameters), and energy-based detection that measures signal power. The ensemble approach reduces false positives and false negatives compared to any single method. The class manages the audio capture lifecycle, including detecting speech onset, tracking active speech segments, and determining when speech has ended based on configurable pause thresholds.

Usage

Use the Microphone class when you need to capture spoken audio from a microphone for downstream processing such as speech-to-text transcription, voice commands, or audio analysis. It is particularly well-suited for interactive applications where you need reliable speech endpoint detection without manual start/stop controls. Access it through txtai's pipeline system.

Code Reference

Source Location

Repository: Neuml_Txtai
File: src/python/txtai/pipeline/audio/microphone.py
Lines: 1-244

Signature

class Microphone(Pipeline):
    def __init__(self, rate=16000, vadmode=3, vadframe=20, vadthreshold=0.6,
                 voicestart=300, voiceend=3400, active=5, pause=8):
        """
        Creates a Microphone pipeline for speech capture.

        Args:
            rate: audio sample rate in Hz (default: 16000)
            vadmode: WebRTC VAD aggressiveness mode 0-3 (default: 3, most aggressive)
            vadframe: VAD frame duration in milliseconds, must be 10/20/30 (default: 20)
            vadthreshold: ensemble VAD threshold for speech detection (default: 0.6)
            voicestart: low frequency cutoff in Hz for band-pass filter (default: 300)
            voiceend: high frequency cutoff in Hz for band-pass filter (default: 3400)
            active: number of consecutive speech frames to confirm speech onset (default: 5)
            pause: number of consecutive silence frames to confirm speech end (default: 8)
        """

    def __call__(self, device=None):
        """
        Captures speech audio from the microphone.

        Args:
            device: audio input device index or None for default device

        Returns:
            tuple of (audio_data, sample_rate) where audio_data is a numpy array
        """

    def listen(self, device):
        """Opens the audio stream and listens for speech segments."""

    def isspeech(self, frame):
        """Determines if an audio frame contains speech using ensemble VAD."""

    def detect(self, frame):
        """Runs WebRTC VAD detection on a single frame."""

    def detectband(self, frame):
        """Applies butterworth band-pass filter and checks for voice frequencies."""

    def detectenergy(self, frame):
        """Measures frame energy to detect speech presence."""

Import

from txtai.pipeline import Microphone

I/O Contract

Inputs

Name	Type	Required	Description
rate	int	No	Audio sample rate in Hz (default: 16000)
vadmode	int	No	WebRTC VAD aggressiveness, 0 (least) to 3 (most aggressive), default: 3
vadframe	int	No	VAD frame duration in milliseconds, must be 10, 20, or 30 (default: 20)
vadthreshold	float	No	Ensemble threshold for classifying a frame as speech, range 0.0-1.0 (default: 0.6)
voicestart	int	No	Low frequency cutoff in Hz for the butterworth band-pass filter (default: 300)
voiceend	int	No	High frequency cutoff in Hz for the butterworth band-pass filter (default: 3400)
active	int	No	Number of consecutive speech frames required to confirm speech onset (default: 5)
pause	int	No	Number of consecutive silence frames required to confirm speech end (default: 8)
device	int	No (for __call__)	Audio input device index, or None to use the system default microphone

Outputs

Name	Type	Description
audio_data	numpy.ndarray	Captured speech audio as a numpy array of audio samples
sample_rate	int	Sample rate of the captured audio in Hz (matches the configured rate)

Usage Examples

Basic Usage

from txtai.pipeline import Microphone

# Create microphone with default settings
microphone = Microphone()

# Capture speech from the default microphone
# Blocks until speech is detected and completed
audio, rate = microphone()
print(f"Captured {len(audio)} samples at {rate} Hz")
print(f"Duration: {len(audio) / rate:.2f} seconds")

Custom VAD Settings

from txtai.pipeline import Microphone

# Create microphone with relaxed VAD for noisy environments
microphone = Microphone(
    rate=16000,
    vadmode=2,           # Less aggressive VAD
    vadthreshold=0.5,    # Lower threshold for speech detection
    voicestart=200,      # Wider frequency range
    voiceend=4000,
    active=3,            # Fewer frames needed to confirm speech
    pause=12             # More silence needed to end capture
)

# Capture from a specific audio device
audio, rate = microphone(device=1)
print(f"Captured {len(audio) / rate:.2f} seconds of audio")

Speech-to-Text Pipeline

from txtai.pipeline import Microphone, Transcription

# Create pipelines
microphone = Microphone(rate=16000, vadmode=3)
transcribe = Transcription("openai/whisper-base")

# Capture and transcribe speech
audio, rate = microphone()
text = transcribe(audio)
print(f"You said: {text}")

Related Pages

Principle:Neuml_Txtai_Voice_Capture

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment