Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Togethercomputer Together python Chat Completion Response

From Leeroopedia
Revision as of 18:04, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Togethercomputer_Together_python_Chat_Completion_Response.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Attribute Value
Principle Name Chat_Completion_Response
Overview Pattern for processing and extracting data from chat completion API responses.
Domain NLP, API_Client, Inference
Repository togethercomputer/together-python
Last Updated 2026-02-15 16:00 GMT

Description

Chat completion response handling covers parsing both non-streaming (complete response) and streaming (chunk-by-chunk) responses from the chat completion API.

Non-Streaming Responses

A non-streaming response is returned as a single ChatCompletionResponse object containing the full generated text, metadata, and token usage statistics. The response follows the OpenAI-compatible format:

  • id -- A unique request identifier.
  • object -- The object type (always "chat.completion").
  • created -- Unix timestamp of when the response was created.
  • model -- The model that generated the response.
  • choices -- An array of generated completions, each containing:
    • index -- The choice index (0-based).
    • message -- The generated ChatCompletionMessage with role and content.
    • finish_reason -- Why generation stopped: "stop" (natural end or stop sequence), "length" (max_tokens reached), "eos" (end-of-sequence token), "tool_calls" (model invoked a function), or "error".
    • logprobs -- Token-level log probabilities (when requested).
    • seed -- The random seed used for generation.
  • usage -- Token count statistics: prompt_tokens, completion_tokens, total_tokens.
  • prompt -- The processed prompt (when echo is enabled).

Streaming Responses

When stream=True, the API returns an iterator of ChatCompletionChunk objects delivered via Server-Sent Events. Each chunk contains a partial update:

  • choices -- Each choice contains a delta with incremental content (typically one or a few tokens per chunk).
  • finish_reason -- Set on the final chunk to indicate why generation stopped; None on intermediate chunks.
  • usage -- Token usage data (may be included on the final chunk).

The consumer iterates over the stream, concatenating delta.content values to reconstruct the full response.

Usage

Use response handling after making any chat completion request to extract generated text, tool calls, token usage, and finish reasons.

When to use:

  • Extracting the generated text from response.choices[0].message.content
  • Checking finish_reason to determine if generation was truncated or completed naturally
  • Reading token usage for billing and monitoring
  • Processing tool calls from the assistant's response
  • Iterating over streaming chunks for real-time display
  • Handling multiple choices when n > 1

Patterns to check:

  • Always verify choices is non-empty before accessing choices[0]
  • For tool calls, check finish_reason == "tool_calls" and iterate over message.tool_calls
  • For streaming, handle the case where delta.content is None (common on the first and last chunks)
  • Monitor usage.total_tokens to track API consumption

Theoretical Basis

API responses follow the OpenAI-compatible format which has become the de facto standard for chat completion APIs. This format provides:

  • Choices array -- Supports multiple independent completions per request (controlled by the n parameter), each with its own finish reason and content.
  • Usage statistics -- Token counts enable cost tracking and context window management. The prompt_tokens count reveals the tokenized size of the input, while completion_tokens tracks the generated output.
  • Finish reasons -- Semantic labels for generation termination conditions allow the application to distinguish between natural completion, truncation, and tool invocation.
  • Streaming deltas -- The chunk-based streaming format uses delta objects instead of complete message objects, containing only the incremental content added since the previous chunk. This minimizes bandwidth and enables progressive rendering.

Knowledge Sources

Source Type URI
Together AI Chat Completions Response Doc Together AI Chat Completions Reference
Together AI Streaming Doc Together AI Chat Overview

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment