Workflow:Openai Openai node Audio Processing
| Knowledge Sources | |
|---|---|
| Domains | Audio, TTS, STT, API_Integration |
| Last Updated | 2026-02-15 12:00 GMT |
Overview
End-to-end process for text-to-speech synthesis, speech-to-text transcription, and audio translation using the OpenAI Audio API through the Node.js SDK.
Description
This workflow covers the OpenAI Audio API's three capabilities: text-to-speech (TTS), transcription (speech-to-text), and translation. TTS converts text into natural-sounding speech using configurable voices and models, returning audio as a streamable response. Transcription converts audio files into text using the Whisper model. Translation converts audio in any supported language into English text. The SDK supports streaming TTS output for real-time playback and provides helper functions for audio recording and playback in Node.js environments.
Usage
Execute this workflow when you need to convert between text and audio. Use TTS for voice interfaces, accessibility features, audiobook generation, or adding voice output to chatbots. Use transcription for processing recorded meetings, voice commands, dictation, or any audio-to-text conversion. Use translation for converting foreign-language audio into English text.
Execution Steps
Step 1: Client Setup
Initialize the OpenAI client. Audio operations use the same client instance as other API calls. For the toFile() helper (needed when converting buffers to uploadable files), import it from the SDK.
Key considerations:
- Import toFile from openai for buffer-to-file conversion
- The same client instance handles all audio operations
- Audio helper functions for recording/playback are in openai/helpers/audio
Step 2: Text-to-Speech Generation
Generate spoken audio from text using client.audio.speech.create(). Specify the model, voice, and input text. The response contains the audio data as a streamable body that can be saved to a file or piped to an audio player.
Key considerations:
- Select a model (e.g., tts-1 for speed, tts-1-hd for quality)
- Choose a voice (alloy, echo, fable, onyx, nova, shimmer)
- The response body is a readable stream for efficient handling
- Convert to buffer with response.arrayBuffer() for in-memory use
- Stream directly to file with pipe for large audio outputs
Step 3: Audio Transcription
Convert audio to text using client.audio.transcriptions.create(). Upload the audio file and specify the model. The SDK accepts various file input types including ReadStream, File objects, and buffers wrapped with toFile().
Key considerations:
- The model is typically whisper-1
- Audio file can be provided as a stream, File, or via toFile() helper
- Supports multiple audio formats (mp3, mp4, mpeg, mpga, m4a, wav, webm)
- Optional language parameter hints the source language
- Returns transcribed text in transcription.text
Step 4: Audio Translation
Translate audio in any supported language to English text using client.audio.translations.create(). This combines speech recognition with translation in a single API call.
Key considerations:
- Translation always outputs English text regardless of source language
- Same file input options as transcription
- Uses the same Whisper model
- Useful for multilingual applications and content processing