Workflow:Cohere ai Cohere python Streaming Chat
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Text_Generation, Streaming, API_Client |
| Last Updated | 2026-02-15 14:00 GMT |
Overview
End-to-end process for streaming chat responses from Cohere models via Server-Sent Events, enabling real-time incremental text delivery to the user.
Description
This workflow demonstrates how to use the Cohere Python SDK's streaming chat endpoint (chat_stream) to receive model responses as they are generated, token by token. The streaming pipeline converts the raw HTTP SSE response into a sequence of typed event objects (message-start, content-delta, tool-call-start, citation-start, message-end, etc.) that can be consumed incrementally. This approach reduces time-to-first-token for interactive applications.
Usage
Execute this workflow when building interactive chat interfaces, CLI tools, or any application where displaying partial responses as they arrive improves user experience. Streaming is essential for long-form generation where waiting for the complete response would introduce unacceptable latency.
Execution Steps
Step 1: Initialize the Streaming Client
Create a ClientV2 instance as in the standard chat workflow. The same client supports both streaming and non-streaming endpoints. Since SDK v5.0.0, the stream=True parameter on chat() is no longer supported; use chat_stream() instead.
Key considerations:
- The same ClientV2 instance serves both chat() and chat_stream()
- The deprecated chat(stream=True) pattern raises a ValueError with migration guidance
- Context manager usage ensures the underlying httpx connection is properly closed
Step 2: Configure the Chat Request
Prepare the model name, message history, and optional parameters (temperature, max_tokens, tools, documents, response_format). The streaming endpoint accepts the same parameters as the non-streaming chat endpoint.
Key considerations:
- All parameters from the non-streaming chat endpoint are supported
- The thinking parameter enables extended reasoning with streaming thought blocks
- The documents parameter enables RAG with citation generation during streaming
- Tool definitions trigger tool-call events in the stream
Step 3: Call the Streaming Endpoint
Invoke chat_stream() which returns an iterator of V2ChatStreamResponse events. Under the hood, the raw client sends the HTTP request and wraps the response in an SSE decoder that parses the event stream line by line.
Key considerations:
- The method returns a typing.Iterator[V2ChatStreamResponse] (sync) or AsyncIterator (async)
- The SSE decoder handles the text/event-stream content type
- Each SSE line is parsed into a ServerSentEvent model with id, event, data, and retry fields
- JSON data payloads are deserialized into discriminated union event types
Step 4: Consume Streaming Events
Iterate over the event stream and handle each event type. The primary event types are: message-start (initial metadata), content-delta (incremental text), tool-call-start/delta/end (function calling), citation-start/end (RAG citations), and message-end (final metadata with usage stats).
Key considerations:
- Filter events by the type field to handle specific event types
- Content deltas contain incremental text at event.delta.message.content.text
- The message-end event carries the finish_reason and usage statistics
- Tool call events arrive in start/delta/end triplets with accumulated arguments
- Citation events reference document sources used in the generated text
Step 5: Aggregate and Finalize
After the stream ends, compile the accumulated content deltas into the final response text. Extract usage metadata from the message-end event for billing and monitoring purposes.
Key considerations:
- Concatenate all content-delta text fields for the complete response
- The message-end event provides aggregated token counts and billing units
- Handle the finish_reason to detect truncation vs. natural completion
- For tool calls, parse the accumulated JSON arguments from delta events