Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Openai Openai python Audio Processing

From Leeroopedia
Knowledge Sources
Domains Audio, Speech_to_Text, Text_to_Speech, API_Integration
Last Updated 2026-02-15 10:00 GMT

Overview

End-to-end process for audio generation and processing using the OpenAI Audio API, covering text-to-speech synthesis, speech-to-text transcription, and audio translation.

Description

This workflow covers the three main audio capabilities provided by the OpenAI Python SDK: text-to-speech (TTS) using models like tts-1, speech-to-text transcription using Whisper (whisper-1), and audio translation to English. The SDK provides both synchronous and asynchronous clients, streaming response support for TTS, and helper classes (Microphone and LocalAudioPlayer) for real-time audio capture and playback. Audio files can be provided as file paths, bytes, or captured directly from the microphone.

Usage

Execute this workflow when you need to convert text to spoken audio, transcribe audio recordings to text, or translate spoken audio from other languages to English text. This is appropriate for voice-enabled applications, accessibility features, content creation tools, or any application that needs to bridge text and audio modalities.

Execution Steps

Step 1: Client Initialization

Create an OpenAI or AsyncOpenAI client instance. The audio APIs are available under client.audio as sub-resources: client.audio.speech for TTS, client.audio.transcriptions for speech-to-text, and client.audio.translations for translation. Install optional dependencies for audio playback (sounddevice) if using the LocalAudioPlayer helper.

Key considerations:

  • Use AsyncOpenAI with LocalAudioPlayer for streaming audio playback
  • Install openai[voice] or sounddevice for audio playback capabilities
  • The Microphone helper requires the sounddevice and numpy packages

Step 2: Text to Speech Generation

Generate spoken audio from text using client.audio.speech.create() or the streaming variant client.audio.speech.with_streaming_response.create(). Specify the model (tts-1 or tts-1-hd), voice (alloy, echo, fable, onyx, nova, shimmer), and input text. The response can be saved to a file using response.stream_to_file() or streamed for real-time playback.

Key considerations:

  • Use with_streaming_response for real-time audio playback with minimal latency
  • The pcm response format works best with LocalAudioPlayer for streaming
  • tts-1 is optimized for speed; tts-1-hd for quality

Step 3: Speech to Text Transcription

Transcribe audio files to text using client.audio.transcriptions.create(). Provide the model name (whisper-1) and the audio file (as a file path, bytes, or a tuple of filename/content/mime-type). The Microphone helper can capture audio directly and produce a file-like object suitable for the transcription endpoint.

Key considerations:

  • Supported audio formats include MP3, WAV, FLAC, and others
  • The Microphone helper simplifies real-time audio capture
  • Set a timeout on Microphone to control recording duration

Step 4: Audio Translation

Translate non-English audio to English text using client.audio.translations.create(). This endpoint accepts the same audio file formats as transcription and returns English text regardless of the source language. It uses the same Whisper model.

Key considerations:

  • Translation always outputs English text
  • The input audio can be in any supported language
  • Uses the same file format and input patterns as transcription

Step 5: Audio Playback (Optional)

For real-time audio playback, use the LocalAudioPlayer helper class which wraps sounddevice for immediate audio output. It accepts streaming responses from the TTS endpoint and plays audio chunks as they arrive, minimizing time-to-first-audio. This is an async-only operation.

Key considerations:

  • Requires sounddevice and optionally numpy packages
  • Works with the pcm response format from TTS
  • Use await LocalAudioPlayer().play(response) with streaming responses

Execution Diagram

GitHub URL

Workflow Repository