Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Neuml Txtai Microphone

From Leeroopedia


Knowledge Sources
Domains Audio, Voice_Detection
Last Updated 2026-02-09 17:00 GMT

Overview

The Microphone class captures speech audio from a microphone device using an ensemble voice activity detection (VAD) approach that combines WebRTC VAD, butterworth band-pass filtering, and energy-based detection.

Description

The Microphone class inherits from Pipeline and provides robust real-time speech capture. It employs a three-pronged voice activity detection strategy: WebRTC VAD for neural-network-based speech detection, a butterworth band-pass filter targeting the human voice frequency range (configurable via voicestart/voiceend parameters), and energy-based detection that measures signal power. The ensemble approach reduces false positives and false negatives compared to any single method. The class manages the audio capture lifecycle, including detecting speech onset, tracking active speech segments, and determining when speech has ended based on configurable pause thresholds.

Usage

Use the Microphone class when you need to capture spoken audio from a microphone for downstream processing such as speech-to-text transcription, voice commands, or audio analysis. It is particularly well-suited for interactive applications where you need reliable speech endpoint detection without manual start/stop controls. Access it through txtai's pipeline system.

Code Reference

Source Location

Signature

class Microphone(Pipeline):
    def __init__(self, rate=16000, vadmode=3, vadframe=20, vadthreshold=0.6,
                 voicestart=300, voiceend=3400, active=5, pause=8):
        """
        Creates a Microphone pipeline for speech capture.

        Args:
            rate: audio sample rate in Hz (default: 16000)
            vadmode: WebRTC VAD aggressiveness mode 0-3 (default: 3, most aggressive)
            vadframe: VAD frame duration in milliseconds, must be 10/20/30 (default: 20)
            vadthreshold: ensemble VAD threshold for speech detection (default: 0.6)
            voicestart: low frequency cutoff in Hz for band-pass filter (default: 300)
            voiceend: high frequency cutoff in Hz for band-pass filter (default: 3400)
            active: number of consecutive speech frames to confirm speech onset (default: 5)
            pause: number of consecutive silence frames to confirm speech end (default: 8)
        """

    def __call__(self, device=None):
        """
        Captures speech audio from the microphone.

        Args:
            device: audio input device index or None for default device

        Returns:
            tuple of (audio_data, sample_rate) where audio_data is a numpy array
        """

    def listen(self, device):
        """Opens the audio stream and listens for speech segments."""

    def isspeech(self, frame):
        """Determines if an audio frame contains speech using ensemble VAD."""

    def detect(self, frame):
        """Runs WebRTC VAD detection on a single frame."""

    def detectband(self, frame):
        """Applies butterworth band-pass filter and checks for voice frequencies."""

    def detectenergy(self, frame):
        """Measures frame energy to detect speech presence."""

Import

from txtai.pipeline import Microphone

I/O Contract

Inputs

Name Type Required Description
rate int No Audio sample rate in Hz (default: 16000)
vadmode int No WebRTC VAD aggressiveness, 0 (least) to 3 (most aggressive), default: 3
vadframe int No VAD frame duration in milliseconds, must be 10, 20, or 30 (default: 20)
vadthreshold float No Ensemble threshold for classifying a frame as speech, range 0.0-1.0 (default: 0.6)
voicestart int No Low frequency cutoff in Hz for the butterworth band-pass filter (default: 300)
voiceend int No High frequency cutoff in Hz for the butterworth band-pass filter (default: 3400)
active int No Number of consecutive speech frames required to confirm speech onset (default: 5)
pause int No Number of consecutive silence frames required to confirm speech end (default: 8)
device int No (for __call__) Audio input device index, or None to use the system default microphone

Outputs

Name Type Description
audio_data numpy.ndarray Captured speech audio as a numpy array of audio samples
sample_rate int Sample rate of the captured audio in Hz (matches the configured rate)

Usage Examples

Basic Usage

from txtai.pipeline import Microphone

# Create microphone with default settings
microphone = Microphone()

# Capture speech from the default microphone
# Blocks until speech is detected and completed
audio, rate = microphone()
print(f"Captured {len(audio)} samples at {rate} Hz")
print(f"Duration: {len(audio) / rate:.2f} seconds")

Custom VAD Settings

from txtai.pipeline import Microphone

# Create microphone with relaxed VAD for noisy environments
microphone = Microphone(
    rate=16000,
    vadmode=2,           # Less aggressive VAD
    vadthreshold=0.5,    # Lower threshold for speech detection
    voicestart=200,      # Wider frequency range
    voiceend=4000,
    active=3,            # Fewer frames needed to confirm speech
    pause=12             # More silence needed to end capture
)

# Capture from a specific audio device
audio, rate = microphone(device=1)
print(f"Captured {len(audio) / rate:.2f} seconds of audio")

Speech-to-Text Pipeline

from txtai.pipeline import Microphone, Transcription

# Create pipelines
microphone = Microphone(rate=16000, vadmode=3)
transcribe = Transcription("openai/whisper-base")

# Capture and transcribe speech
audio, rate = microphone()
text = transcribe(audio)
print(f"You said: {text}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment