Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Openai Openai node Audio Processing

From Leeroopedia
Knowledge Sources
Domains Audio, TTS, STT, API_Integration
Last Updated 2026-02-15 12:00 GMT

Overview

End-to-end process for text-to-speech synthesis, speech-to-text transcription, and audio translation using the OpenAI Audio API through the Node.js SDK.

Description

This workflow covers the OpenAI Audio API's three capabilities: text-to-speech (TTS), transcription (speech-to-text), and translation. TTS converts text into natural-sounding speech using configurable voices and models, returning audio as a streamable response. Transcription converts audio files into text using the Whisper model. Translation converts audio in any supported language into English text. The SDK supports streaming TTS output for real-time playback and provides helper functions for audio recording and playback in Node.js environments.

Usage

Execute this workflow when you need to convert between text and audio. Use TTS for voice interfaces, accessibility features, audiobook generation, or adding voice output to chatbots. Use transcription for processing recorded meetings, voice commands, dictation, or any audio-to-text conversion. Use translation for converting foreign-language audio into English text.

Execution Steps

Step 1: Client Setup

Initialize the OpenAI client. Audio operations use the same client instance as other API calls. For the toFile() helper (needed when converting buffers to uploadable files), import it from the SDK.

Key considerations:

  • Import toFile from openai for buffer-to-file conversion
  • The same client instance handles all audio operations
  • Audio helper functions for recording/playback are in openai/helpers/audio

Step 2: Text-to-Speech Generation

Generate spoken audio from text using client.audio.speech.create(). Specify the model, voice, and input text. The response contains the audio data as a streamable body that can be saved to a file or piped to an audio player.

Key considerations:

  • Select a model (e.g., tts-1 for speed, tts-1-hd for quality)
  • Choose a voice (alloy, echo, fable, onyx, nova, shimmer)
  • The response body is a readable stream for efficient handling
  • Convert to buffer with response.arrayBuffer() for in-memory use
  • Stream directly to file with pipe for large audio outputs

Step 3: Audio Transcription

Convert audio to text using client.audio.transcriptions.create(). Upload the audio file and specify the model. The SDK accepts various file input types including ReadStream, File objects, and buffers wrapped with toFile().

Key considerations:

  • The model is typically whisper-1
  • Audio file can be provided as a stream, File, or via toFile() helper
  • Supports multiple audio formats (mp3, mp4, mpeg, mpga, m4a, wav, webm)
  • Optional language parameter hints the source language
  • Returns transcribed text in transcription.text

Step 4: Audio Translation

Translate audio in any supported language to English text using client.audio.translations.create(). This combines speech recognition with translation in a single API call.

Key considerations:

  • Translation always outputs English text regardless of source language
  • Same file input options as transcription
  • Uses the same Whisper model
  • Useful for multilingual applications and content processing

Execution Diagram

GitHub URL

Workflow Repository