Implementation:Togethercomputer Together python Audio Speech Types

Knowledge Sources	Together Python
Domains	Audio, Type_System
Last Updated	2026-02-15 16:00 GMT

Overview

Concrete type definitions for audio speech, transcription, and translation APIs provided by the Together Python SDK.

Description

This module defines the comprehensive type system for all audio APIs. Key types include: AudioSpeechRequest (TTS request parameters), AudioSpeechStreamResponse (streaming audio response with stream_to_file() helper), AudioTranscriptionRequest/AudioTranslationRequest (STT request parameters), response formats for both simple (AudioTranscriptionResponse) and verbose (AudioTranscriptionVerboseResponse) outputs with timestamps and speaker diarization, and VoiceListResponse for available voices. Also defines enums for audio formats, languages, and encodings.

Usage

Import these types when you need to type-hint audio-related data structures, configure audio request parameters, or process audio response objects.

Code Reference

Source Location

Repository: Together Python
File: src/together/types/audio_speech.py
Lines: 1-311

Signature

class AudioResponseFormat(str, Enum):
    MP3 = "mp3"
    WAV = "wav"
    RAW = "raw"

class AudioLanguage(str, Enum):
    EN = "en"; DE = "de"; FR = "fr"; ES = "es"
    # ... 15 languages total

class AudioSpeechRequest(BaseModel):
    model: str
    input: str
    voice: str | None = None
    response_format: AudioResponseFormat = AudioResponseFormat.MP3
    language: AudioLanguage = AudioLanguage.EN
    response_encoding: AudioResponseEncoding = AudioResponseEncoding.PCM_F32LE
    sample_rate: int = 44100
    stream: bool = False

class AudioSpeechStreamResponse(BaseModel):
    response: TogetherResponse | Iterator[TogetherResponse]
    def stream_to_file(self, file_path: str, response_format=None) -> None: ...

class AudioTranscriptionResponse(BaseModel):
    text: str

class AudioTranscriptionVerboseResponse(BaseModel):
    text: str
    segments: Optional[List[AudioTranscriptionSegment]] = None
    words: Optional[List[AudioTranscriptionWord]] = None
    speaker_segments: Optional[List[AudioSpeakerSegment]] = None

class VoiceListResponse(BaseModel):
    data: List[ModelVoices]

Import

from together.types.audio_speech import (
    AudioResponseFormat, AudioLanguage, AudioSpeechRequest,
    AudioSpeechStreamResponse, AudioTranscriptionResponse,
    AudioTranscriptionVerboseResponse, VoiceListResponse,
)

I/O Contract

Inputs

Name	Type	Required	Description
(constructed from API response/request dicts)	Dict/params	Yes	Audio API parameters and response data

Outputs

Name	Type	Description
AudioSpeechStreamResponse	Pydantic Model	Audio data with stream_to_file() helper
AudioTranscriptionResponse	Pydantic Model	Simple text transcription
AudioTranscriptionVerboseResponse	Pydantic Model	Detailed transcription with timestamps and speaker info
VoiceListResponse	Pydantic Model	Available models and their voices

Usage Examples

from together import Together

client = Together()

# TTS - AudioSpeechStreamResponse with stream_to_file
audio = client.audio.speech.create(
    model="cartesia/sonic",
    input="Hello, world!",
    voice="laidback woman",
)
audio.stream_to_file("output.wav")

# Transcription - returns AudioTranscriptionResponse or verbose variant
transcript = client.audio.transcriptions.create(
    file="audio.mp3",
    model="openai/whisper-large-v3",
    response_format="verbose_json",
)
print(transcript.text)
if hasattr(transcript, 'segments'):
    for seg in transcript.segments:
        print(f"[{seg.start}-{seg.end}] {seg.text}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment