Workflow:Openai Openai python Audio Processing

Knowledge Sources	OpenAI Python SDK OpenAI Audio API Reference OpenAI Text-to-Speech Guide
Domains	Audio, Speech_to_Text, Text_to_Speech, API_Integration
Last Updated	2026-02-15 10:00 GMT

Overview

End-to-end process for audio generation and processing using the OpenAI Audio API, covering text-to-speech synthesis, speech-to-text transcription, and audio translation.

Description

This workflow covers the three main audio capabilities provided by the OpenAI Python SDK: text-to-speech (TTS) using models like tts-1, speech-to-text transcription using Whisper (whisper-1), and audio translation to English. The SDK provides both synchronous and asynchronous clients, streaming response support for TTS, and helper classes (Microphone and LocalAudioPlayer) for real-time audio capture and playback. Audio files can be provided as file paths, bytes, or captured directly from the microphone.

Usage

Execute this workflow when you need to convert text to spoken audio, transcribe audio recordings to text, or translate spoken audio from other languages to English text. This is appropriate for voice-enabled applications, accessibility features, content creation tools, or any application that needs to bridge text and audio modalities.

Execution Steps

Step 1: Client Initialization

Create an OpenAI or AsyncOpenAI client instance. The audio APIs are available under client.audio as sub-resources: client.audio.speech for TTS, client.audio.transcriptions for speech-to-text, and client.audio.translations for translation. Install optional dependencies for audio playback (sounddevice) if using the LocalAudioPlayer helper.

Key considerations:

Use AsyncOpenAI with LocalAudioPlayer for streaming audio playback
Install openai[voice] or sounddevice for audio playback capabilities
The Microphone helper requires the sounddevice and numpy packages

Step 2: Text to Speech Generation

Generate spoken audio from text using client.audio.speech.create() or the streaming variant client.audio.speech.with_streaming_response.create(). Specify the model (tts-1 or tts-1-hd), voice (alloy, echo, fable, onyx, nova, shimmer), and input text. The response can be saved to a file using response.stream_to_file() or streamed for real-time playback.

Key considerations:

Use with_streaming_response for real-time audio playback with minimal latency
The pcm response format works best with LocalAudioPlayer for streaming
tts-1 is optimized for speed; tts-1-hd for quality

Step 3: Speech to Text Transcription

Transcribe audio files to text using client.audio.transcriptions.create(). Provide the model name (whisper-1) and the audio file (as a file path, bytes, or a tuple of filename/content/mime-type). The Microphone helper can capture audio directly and produce a file-like object suitable for the transcription endpoint.

Key considerations:

Supported audio formats include MP3, WAV, FLAC, and others
The Microphone helper simplifies real-time audio capture
Set a timeout on Microphone to control recording duration

Step 4: Audio Translation

Translate non-English audio to English text using client.audio.translations.create(). This endpoint accepts the same audio file formats as transcription and returns English text regardless of the source language. It uses the same Whisper model.

Key considerations:

Translation always outputs English text
The input audio can be in any supported language
Uses the same file format and input patterns as transcription

Step 5: Audio Playback (Optional)

For real-time audio playback, use the LocalAudioPlayer helper class which wraps sounddevice for immediate audio output. It accepts streaming responses from the TTS endpoint and plays audio chunks as they arrive, minimizing time-to-first-audio. This is an async-only operation.

Key considerations:

Requires sounddevice and optionally numpy packages
Works with the pcm response format from TTS
Use await LocalAudioPlayer().play(response) with streaming responses

Execution Diagram

GitHub URL

Workflow Repository