Principle:Togethercomputer Together python Chat Completion Response

Attribute	Value
Principle Name	Chat_Completion_Response
Overview	Pattern for processing and extracting data from chat completion API responses.
Domain	NLP, API_Client, Inference
Repository	togethercomputer/together-python
Last Updated	2026-02-15 16:00 GMT

Description

Chat completion response handling covers parsing both non-streaming (complete response) and streaming (chunk-by-chunk) responses from the chat completion API.

Non-Streaming Responses

A non-streaming response is returned as a single ChatCompletionResponse object containing the full generated text, metadata, and token usage statistics. The response follows the OpenAI-compatible format:

id -- A unique request identifier.
object -- The object type (always "chat.completion").
created -- Unix timestamp of when the response was created.
model -- The model that generated the response.
choices -- An array of generated completions, each containing:
- index -- The choice index (0-based).
- message -- The generated ChatCompletionMessage with role and content.
- finish_reason -- Why generation stopped: "stop" (natural end or stop sequence), "length" (max_tokens reached), "eos" (end-of-sequence token), "tool_calls" (model invoked a function), or "error".
- logprobs -- Token-level log probabilities (when requested).
- seed -- The random seed used for generation.
usage -- Token count statistics: prompt_tokens, completion_tokens, total_tokens.
prompt -- The processed prompt (when echo is enabled).

Streaming Responses

When stream=True, the API returns an iterator of ChatCompletionChunk objects delivered via Server-Sent Events. Each chunk contains a partial update:

choices -- Each choice contains a delta with incremental content (typically one or a few tokens per chunk).
finish_reason -- Set on the final chunk to indicate why generation stopped; None on intermediate chunks.
usage -- Token usage data (may be included on the final chunk).

The consumer iterates over the stream, concatenating delta.content values to reconstruct the full response.

Usage

Use response handling after making any chat completion request to extract generated text, tool calls, token usage, and finish reasons.

When to use:

Extracting the generated text from response.choices[0].message.content
Checking finish_reason to determine if generation was truncated or completed naturally
Reading token usage for billing and monitoring
Processing tool calls from the assistant's response
Iterating over streaming chunks for real-time display
Handling multiple choices when n > 1

Patterns to check:

Always verify choices is non-empty before accessing choices[0]
For tool calls, check finish_reason == "tool_calls" and iterate over message.tool_calls
For streaming, handle the case where delta.content is None (common on the first and last chunks)
Monitor usage.total_tokens to track API consumption

Theoretical Basis

API responses follow the OpenAI-compatible format which has become the de facto standard for chat completion APIs. This format provides:

Choices array -- Supports multiple independent completions per request (controlled by the n parameter), each with its own finish reason and content.
Usage statistics -- Token counts enable cost tracking and context window management. The prompt_tokens count reveals the tokenized size of the input, while completion_tokens tracks the generated output.
Finish reasons -- Semantic labels for generation termination conditions allow the application to distinguish between natural completion, truncation, and tool invocation.
Streaming deltas -- The chunk-based streaming format uses delta objects instead of complete message objects, containing only the incremental content added since the previous chunk. This minimizes bandwidth and enables progressive rendering.

Knowledge Sources

Source	Type	URI
Together AI Chat Completions Response	Doc	Together AI Chat Completions Reference
Together AI Streaming	Doc	Together AI Chat Overview

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment