Implementation:Neuml Txtai Transcription
| Knowledge Sources | |
|---|---|
| Domains | Audio, Speech Recognition, NLP |
| Last Updated | 2026-02-10 01:00 GMT |
Overview
Concrete tool for transcribing audio to text provided by txtai.
Description
Transcription is a pipeline that transcribes audio files or raw audio data to text using Hugging Face automatic speech recognition (ASR) models. It extends the HFPipeline base class, wrapping the automatic-speech-recognition task. The pipeline supports multiple audio input formats: file paths, file-like objects, NumPy arrays with optional sample rate, and (audio, rate) tuples. Audio is automatically converted to mono and resampled to the model's expected sample rate. It supports chunked processing to handle long audio files by splitting them into configurable segment durations. Two processing modes are available: a standard mode that returns transcribed text per input, and a batch mode that returns per-chunk results with the original raw audio data and sample rate for each chunk. Text normalization converts all-uppercase output to capitalized case.
Usage
Use Transcription when you need to convert audio recordings into text. This is useful for speech-to-text applications, meeting transcription, voice command processing, podcast indexing, or any workflow that requires extracting text from audio content. It pairs naturally with the Microphone pipeline for real-time voice input.
Code Reference
Source Location
- Repository: Neuml_Txtai
- File: src/python/txtai/pipeline/audio/transcription.py
Signature
class Transcription(HFPipeline):
def __init__(self, path=None, quantize=False, gpu=True, model=None, **kwargs)
def __call__(self, audio, rate=None, chunk=10, join=True, **kwargs)
Import
from txtai.pipeline.audio.transcription import Transcription
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path | str | No | Model path or Hugging Face repo id for the ASR model. |
| quantize | bool | No | Enable model quantization. Defaults to False. |
| gpu | bool | No | Use GPU acceleration if available. Defaults to True. |
| model | object | No | Optional pre-loaded model instance. |
| audio | str, tuple, numpy.ndarray, file-like, or list | Yes | Audio input: a file path, (audio_data, rate) tuple, NumPy array, file-like object, or a list of any of these. |
| rate | int | No | Sample rate of the input audio. Only required when audio is a raw NumPy array without an accompanying sample rate. |
| chunk | int | No | Duration in seconds to split audio into for processing. Defaults to 10. |
| join | bool | No | If True (default), combines all chunk transcriptions into a single text string. If False, returns per-chunk results with raw audio data. |
| kwargs | dict | No | Additional keyword arguments passed to the model's generate method. |
Outputs
| Name | Type | Description |
|---|---|---|
| result | str | Transcribed text when a single audio input is provided and join=True. |
| results | list of str | List of transcribed text strings when a list of audio inputs is provided and join=True. |
| results | list of list of dict | When join=False, a list of lists of dicts. Each dict contains "text" (str), "raw" (numpy.ndarray), and "rate" (int) keys. |
Usage Examples
from txtai.pipeline import Transcription
# Create a transcription pipeline
transcribe = Transcription()
# Transcribe an audio file
text = transcribe("audio.wav")
# Transcribe a NumPy array with sample rate
import numpy as np
audio_data = np.random.randn(16000).astype(np.float32)
text = transcribe(audio_data, rate=16000)
# Transcribe with a tuple of (audio, rate)
text = transcribe((audio_data, 16000))
# Batch transcribe multiple audio files
texts = transcribe(["audio1.wav", "audio2.wav"])
# Get per-chunk results with raw audio data
chunks = transcribe("long_audio.wav", chunk=5, join=False)
for chunk_list in chunks:
for chunk in chunk_list:
print(chunk["text"], chunk["rate"])
# Use a specific model
transcribe_whisper = Transcription(path="openai/whisper-base")
text = transcribe_whisper("audio.wav")