Workflow:Openai Openai node Audio Processing

Knowledge Sources	OpenAI Node SDK Audio API Reference Text-to-Speech Guide Speech-to-Text Guide
Domains	Audio, TTS, STT, API_Integration
Last Updated	2026-02-15 12:00 GMT

Overview

End-to-end process for text-to-speech synthesis, speech-to-text transcription, and audio translation using the OpenAI Audio API through the Node.js SDK.

Description

This workflow covers the OpenAI Audio API's three capabilities: text-to-speech (TTS), transcription (speech-to-text), and translation. TTS converts text into natural-sounding speech using configurable voices and models, returning audio as a streamable response. Transcription converts audio files into text using the Whisper model. Translation converts audio in any supported language into English text. The SDK supports streaming TTS output for real-time playback and provides helper functions for audio recording and playback in Node.js environments.

Usage

Execute this workflow when you need to convert between text and audio. Use TTS for voice interfaces, accessibility features, audiobook generation, or adding voice output to chatbots. Use transcription for processing recorded meetings, voice commands, dictation, or any audio-to-text conversion. Use translation for converting foreign-language audio into English text.

Execution Steps

Step 1: Client Setup

Initialize the OpenAI client. Audio operations use the same client instance as other API calls. For the toFile() helper (needed when converting buffers to uploadable files), import it from the SDK.

Key considerations:

Import toFile from openai for buffer-to-file conversion
The same client instance handles all audio operations
Audio helper functions for recording/playback are in openai/helpers/audio

Step 2: Text-to-Speech Generation

Generate spoken audio from text using client.audio.speech.create(). Specify the model, voice, and input text. The response contains the audio data as a streamable body that can be saved to a file or piped to an audio player.

Key considerations:

Select a model (e.g., tts-1 for speed, tts-1-hd for quality)
Choose a voice (alloy, echo, fable, onyx, nova, shimmer)
The response body is a readable stream for efficient handling
Convert to buffer with response.arrayBuffer() for in-memory use
Stream directly to file with pipe for large audio outputs

Step 3: Audio Transcription

Convert audio to text using client.audio.transcriptions.create(). Upload the audio file and specify the model. The SDK accepts various file input types including ReadStream, File objects, and buffers wrapped with toFile().

Key considerations:

The model is typically whisper-1
Audio file can be provided as a stream, File, or via toFile() helper
Supports multiple audio formats (mp3, mp4, mpeg, mpga, m4a, wav, webm)
Optional language parameter hints the source language
Returns transcribed text in transcription.text

Step 4: Audio Translation

Translate audio in any supported language to English text using client.audio.translations.create(). This combines speech recognition with translation in a single API call.

Key considerations:

Translation always outputs English text regardless of source language
Same file input options as transcription
Uses the same Whisper model
Useful for multilingual applications and content processing

Execution Diagram

GitHub URL

Workflow Repository