Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Togethercomputer Together python Audio Speech Types

From Leeroopedia
Knowledge Sources
Domains Audio, Type_System
Last Updated 2026-02-15 16:00 GMT

Overview

Concrete type definitions for audio speech, transcription, and translation APIs provided by the Together Python SDK.

Description

This module defines the comprehensive type system for all audio APIs. Key types include: AudioSpeechRequest (TTS request parameters), AudioSpeechStreamResponse (streaming audio response with stream_to_file() helper), AudioTranscriptionRequest/AudioTranslationRequest (STT request parameters), response formats for both simple (AudioTranscriptionResponse) and verbose (AudioTranscriptionVerboseResponse) outputs with timestamps and speaker diarization, and VoiceListResponse for available voices. Also defines enums for audio formats, languages, and encodings.

Usage

Import these types when you need to type-hint audio-related data structures, configure audio request parameters, or process audio response objects.

Code Reference

Source Location

Signature

class AudioResponseFormat(str, Enum):
    MP3 = "mp3"
    WAV = "wav"
    RAW = "raw"

class AudioLanguage(str, Enum):
    EN = "en"; DE = "de"; FR = "fr"; ES = "es"
    # ... 15 languages total

class AudioSpeechRequest(BaseModel):
    model: str
    input: str
    voice: str | None = None
    response_format: AudioResponseFormat = AudioResponseFormat.MP3
    language: AudioLanguage = AudioLanguage.EN
    response_encoding: AudioResponseEncoding = AudioResponseEncoding.PCM_F32LE
    sample_rate: int = 44100
    stream: bool = False

class AudioSpeechStreamResponse(BaseModel):
    response: TogetherResponse | Iterator[TogetherResponse]
    def stream_to_file(self, file_path: str, response_format=None) -> None: ...

class AudioTranscriptionResponse(BaseModel):
    text: str

class AudioTranscriptionVerboseResponse(BaseModel):
    text: str
    segments: Optional[List[AudioTranscriptionSegment]] = None
    words: Optional[List[AudioTranscriptionWord]] = None
    speaker_segments: Optional[List[AudioSpeakerSegment]] = None

class VoiceListResponse(BaseModel):
    data: List[ModelVoices]

Import

from together.types.audio_speech import (
    AudioResponseFormat, AudioLanguage, AudioSpeechRequest,
    AudioSpeechStreamResponse, AudioTranscriptionResponse,
    AudioTranscriptionVerboseResponse, VoiceListResponse,
)

I/O Contract

Inputs

Name Type Required Description
(constructed from API response/request dicts) Dict/params Yes Audio API parameters and response data

Outputs

Name Type Description
AudioSpeechStreamResponse Pydantic Model Audio data with stream_to_file() helper
AudioTranscriptionResponse Pydantic Model Simple text transcription
AudioTranscriptionVerboseResponse Pydantic Model Detailed transcription with timestamps and speaker info
VoiceListResponse Pydantic Model Available models and their voices

Usage Examples

from together import Together

client = Together()

# TTS - AudioSpeechStreamResponse with stream_to_file
audio = client.audio.speech.create(
    model="cartesia/sonic",
    input="Hello, world!",
    voice="laidback woman",
)
audio.stream_to_file("output.wav")

# Transcription - returns AudioTranscriptionResponse or verbose variant
transcript = client.audio.transcriptions.create(
    file="audio.mp3",
    model="openai/whisper-large-v3",
    response_format="verbose_json",
)
print(transcript.text)
if hasattr(transcript, 'segments'):
    for seg in transcript.segments:
        print(f"[{seg.start}-{seg.end}] {seg.text}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment