Implementation:Togethercomputer Together python Audio Speech Types
| Knowledge Sources | |
|---|---|
| Domains | Audio, Type_System |
| Last Updated | 2026-02-15 16:00 GMT |
Overview
Concrete type definitions for audio speech, transcription, and translation APIs provided by the Together Python SDK.
Description
This module defines the comprehensive type system for all audio APIs. Key types include: AudioSpeechRequest (TTS request parameters), AudioSpeechStreamResponse (streaming audio response with stream_to_file() helper), AudioTranscriptionRequest/AudioTranslationRequest (STT request parameters), response formats for both simple (AudioTranscriptionResponse) and verbose (AudioTranscriptionVerboseResponse) outputs with timestamps and speaker diarization, and VoiceListResponse for available voices. Also defines enums for audio formats, languages, and encodings.
Usage
Import these types when you need to type-hint audio-related data structures, configure audio request parameters, or process audio response objects.
Code Reference
Source Location
- Repository: Together Python
- File: src/together/types/audio_speech.py
- Lines: 1-311
Signature
class AudioResponseFormat(str, Enum):
MP3 = "mp3"
WAV = "wav"
RAW = "raw"
class AudioLanguage(str, Enum):
EN = "en"; DE = "de"; FR = "fr"; ES = "es"
# ... 15 languages total
class AudioSpeechRequest(BaseModel):
model: str
input: str
voice: str | None = None
response_format: AudioResponseFormat = AudioResponseFormat.MP3
language: AudioLanguage = AudioLanguage.EN
response_encoding: AudioResponseEncoding = AudioResponseEncoding.PCM_F32LE
sample_rate: int = 44100
stream: bool = False
class AudioSpeechStreamResponse(BaseModel):
response: TogetherResponse | Iterator[TogetherResponse]
def stream_to_file(self, file_path: str, response_format=None) -> None: ...
class AudioTranscriptionResponse(BaseModel):
text: str
class AudioTranscriptionVerboseResponse(BaseModel):
text: str
segments: Optional[List[AudioTranscriptionSegment]] = None
words: Optional[List[AudioTranscriptionWord]] = None
speaker_segments: Optional[List[AudioSpeakerSegment]] = None
class VoiceListResponse(BaseModel):
data: List[ModelVoices]
Import
from together.types.audio_speech import (
AudioResponseFormat, AudioLanguage, AudioSpeechRequest,
AudioSpeechStreamResponse, AudioTranscriptionResponse,
AudioTranscriptionVerboseResponse, VoiceListResponse,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| (constructed from API response/request dicts) | Dict/params | Yes | Audio API parameters and response data |
Outputs
| Name | Type | Description |
|---|---|---|
| AudioSpeechStreamResponse | Pydantic Model | Audio data with stream_to_file() helper |
| AudioTranscriptionResponse | Pydantic Model | Simple text transcription |
| AudioTranscriptionVerboseResponse | Pydantic Model | Detailed transcription with timestamps and speaker info |
| VoiceListResponse | Pydantic Model | Available models and their voices |
Usage Examples
from together import Together
client = Together()
# TTS - AudioSpeechStreamResponse with stream_to_file
audio = client.audio.speech.create(
model="cartesia/sonic",
input="Hello, world!",
voice="laidback woman",
)
audio.stream_to_file("output.wav")
# Transcription - returns AudioTranscriptionResponse or verbose variant
transcript = client.audio.transcriptions.create(
file="audio.mp3",
model="openai/whisper-large-v3",
response_format="verbose_json",
)
print(transcript.text)
if hasattr(transcript, 'segments'):
for seg in transcript.segments:
print(f"[{seg.start}-{seg.end}] {seg.text}")