Workflow:Cohere ai Cohere python Streaming Chat

Knowledge Sources	Cohere Python SDK Cohere API Docs Chat API Guide
Domains	LLMs, Text_Generation, Streaming, API_Client
Last Updated	2026-02-15 14:00 GMT

Overview

End-to-end process for streaming chat responses from Cohere models via Server-Sent Events, enabling real-time incremental text delivery to the user.

Description

This workflow demonstrates how to use the Cohere Python SDK's streaming chat endpoint (chat_stream) to receive model responses as they are generated, token by token. The streaming pipeline converts the raw HTTP SSE response into a sequence of typed event objects (message-start, content-delta, tool-call-start, citation-start, message-end, etc.) that can be consumed incrementally. This approach reduces time-to-first-token for interactive applications.

Usage

Execute this workflow when building interactive chat interfaces, CLI tools, or any application where displaying partial responses as they arrive improves user experience. Streaming is essential for long-form generation where waiting for the complete response would introduce unacceptable latency.

Execution Steps

Step 1: Initialize the Streaming Client

Create a ClientV2 instance as in the standard chat workflow. The same client supports both streaming and non-streaming endpoints. Since SDK v5.0.0, the stream=True parameter on chat() is no longer supported; use chat_stream() instead.

Key considerations:

The same ClientV2 instance serves both chat() and chat_stream()
The deprecated chat(stream=True) pattern raises a ValueError with migration guidance
Context manager usage ensures the underlying httpx connection is properly closed

Step 2: Configure the Chat Request

Prepare the model name, message history, and optional parameters (temperature, max_tokens, tools, documents, response_format). The streaming endpoint accepts the same parameters as the non-streaming chat endpoint.

Key considerations:

All parameters from the non-streaming chat endpoint are supported
The thinking parameter enables extended reasoning with streaming thought blocks
The documents parameter enables RAG with citation generation during streaming
Tool definitions trigger tool-call events in the stream

Step 3: Call the Streaming Endpoint

Invoke chat_stream() which returns an iterator of V2ChatStreamResponse events. Under the hood, the raw client sends the HTTP request and wraps the response in an SSE decoder that parses the event stream line by line.

Key considerations:

The method returns a typing.Iterator[V2ChatStreamResponse] (sync) or AsyncIterator (async)
The SSE decoder handles the text/event-stream content type
Each SSE line is parsed into a ServerSentEvent model with id, event, data, and retry fields
JSON data payloads are deserialized into discriminated union event types

Step 4: Consume Streaming Events

Iterate over the event stream and handle each event type. The primary event types are: message-start (initial metadata), content-delta (incremental text), tool-call-start/delta/end (function calling), citation-start/end (RAG citations), and message-end (final metadata with usage stats).

Key considerations:

Filter events by the type field to handle specific event types
Content deltas contain incremental text at event.delta.message.content.text
The message-end event carries the finish_reason and usage statistics
Tool call events arrive in start/delta/end triplets with accumulated arguments
Citation events reference document sources used in the generated text

Step 5: Aggregate and Finalize

After the stream ends, compile the accumulated content deltas into the final response text. Extract usage metadata from the message-end event for billing and monitoring purposes.

Key considerations:

Concatenate all content-delta text fields for the complete response
The message-end event provides aggregated token counts and billing units
Handle the finish_reason to detect truncation vs. natural completion
For tool calls, parse the accumulated JSON arguments from delta events

Execution Diagram

GitHub URL

Workflow Repository