Principle:Togethercomputer Together python Chat Completion Response
| Attribute | Value |
|---|---|
| Principle Name | Chat_Completion_Response |
| Overview | Pattern for processing and extracting data from chat completion API responses. |
| Domain | NLP, API_Client, Inference |
| Repository | togethercomputer/together-python |
| Last Updated | 2026-02-15 16:00 GMT |
Description
Chat completion response handling covers parsing both non-streaming (complete response) and streaming (chunk-by-chunk) responses from the chat completion API.
Non-Streaming Responses
A non-streaming response is returned as a single ChatCompletionResponse object containing the full generated text, metadata, and token usage statistics. The response follows the OpenAI-compatible format:
- id -- A unique request identifier.
- object -- The object type (always
"chat.completion"). - created -- Unix timestamp of when the response was created.
- model -- The model that generated the response.
- choices -- An array of generated completions, each containing:
- index -- The choice index (0-based).
- message -- The generated
ChatCompletionMessagewith role and content. - finish_reason -- Why generation stopped:
"stop"(natural end or stop sequence),"length"(max_tokens reached),"eos"(end-of-sequence token),"tool_calls"(model invoked a function), or"error". - logprobs -- Token-level log probabilities (when requested).
- seed -- The random seed used for generation.
- usage -- Token count statistics:
prompt_tokens,completion_tokens,total_tokens. - prompt -- The processed prompt (when echo is enabled).
Streaming Responses
When stream=True, the API returns an iterator of ChatCompletionChunk objects delivered via Server-Sent Events. Each chunk contains a partial update:
- choices -- Each choice contains a
deltawith incrementalcontent(typically one or a few tokens per chunk). - finish_reason -- Set on the final chunk to indicate why generation stopped;
Noneon intermediate chunks. - usage -- Token usage data (may be included on the final chunk).
The consumer iterates over the stream, concatenating delta.content values to reconstruct the full response.
Usage
Use response handling after making any chat completion request to extract generated text, tool calls, token usage, and finish reasons.
When to use:
- Extracting the generated text from
response.choices[0].message.content - Checking
finish_reasonto determine if generation was truncated or completed naturally - Reading token usage for billing and monitoring
- Processing tool calls from the assistant's response
- Iterating over streaming chunks for real-time display
- Handling multiple choices when
n > 1
Patterns to check:
- Always verify
choicesis non-empty before accessingchoices[0] - For tool calls, check
finish_reason == "tool_calls"and iterate overmessage.tool_calls - For streaming, handle the case where
delta.contentisNone(common on the first and last chunks) - Monitor
usage.total_tokensto track API consumption
Theoretical Basis
API responses follow the OpenAI-compatible format which has become the de facto standard for chat completion APIs. This format provides:
- Choices array -- Supports multiple independent completions per request (controlled by the
nparameter), each with its own finish reason and content. - Usage statistics -- Token counts enable cost tracking and context window management. The
prompt_tokenscount reveals the tokenized size of the input, whilecompletion_tokenstracks the generated output. - Finish reasons -- Semantic labels for generation termination conditions allow the application to distinguish between natural completion, truncation, and tool invocation.
- Streaming deltas -- The chunk-based streaming format uses
deltaobjects instead of completemessageobjects, containing only the incremental content added since the previous chunk. This minimizes bandwidth and enables progressive rendering.
Knowledge Sources
| Source | Type | URI |
|---|---|---|
| Together AI Chat Completions Response | Doc | Together AI Chat Completions Reference |
| Together AI Streaming | Doc | Together AI Chat Overview |