Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Langchain ai Langchain Streaming Responses

From Leeroopedia
Knowledge Sources
Domains LLMs, Streaming, Chat_Models
Last Updated 2026-02-11 15:00 GMT

Overview

End-to-end process for streaming chat model responses token-by-token through LangChain's synchronous and asynchronous streaming interfaces.

Description

This workflow describes how to receive incremental response chunks from chat models in real-time, rather than waiting for the complete response. LangChain provides streaming through the stream() and astream() methods on any chat model, which yield ChatGenerationChunk objects as the provider generates tokens. Each chunk contains a partial AIMessageChunk with incremental content, partial tool calls, and optional usage metadata. Streaming is essential for responsive user interfaces, long-running generations, and applications that need to process output before the model finishes.

Usage

Execute this workflow when building interactive applications that need to display model output as it is generated (chatbots, code assistants, writing tools), when processing long responses where waiting for completion is unacceptable, or when implementing server-sent events (SSE) for web applications. Streaming is also useful for early termination: you can stop generation mid-stream if the output meets criteria.

Execution Steps

Step 1: Initialize the Chat Model with Streaming Support

Create a chat model instance. All LangChain chat model providers support streaming by default. No special configuration is needed to enable streaming, as it is controlled at invocation time by choosing stream() instead of invoke(). Some models support a stream_usage parameter that includes token usage metadata in streaming chunks.

Key considerations:

  • All modern LangChain providers implement native streaming (not simulated)
  • The streaming=True constructor parameter can make invoke() use streaming internally
  • stream_usage=True enables per-chunk token usage reporting (not all providers support this)

Step 2: Prepare the Input Messages

Construct the input prompt as for any chat model invocation. The input format is identical whether streaming or not: a string, list of messages, or PromptValue. The input is converted to normalized messages before being sent to the provider.

Key considerations:

  • Input preparation is identical to non-streaming invocation
  • System messages, tool definitions, and conversation history work the same way
  • The streaming decision happens at invocation time, not at model creation

Step 3: Call the Streaming Method

Invoke stream() (synchronous) or astream() (asynchronous) on the chat model. These methods call the provider-specific _stream() or _astream() implementation, which opens a streaming connection to the provider API and yields chunks as they arrive. Each chunk is a ChatGenerationChunk wrapping an AIMessageChunk.

Key considerations:

  • stream() returns a Python Iterator[ChatGenerationChunk] for use in for-loops
  • astream() returns an AsyncIterator[ChatGenerationChunk] for use in async for-loops
  • The underlying HTTP connection remains open for the duration of streaming
  • Callbacks (on_llm_new_token) fire for each chunk

Step 4: Process Incremental Chunks

Iterate over the yielded chunks and process each one. Each AIMessageChunk contains a content field with the incremental text (typically one or a few tokens), and may include partial tool_call_chunks if the model is generating a tool call. Chunks can be concatenated using the + operator to build up the complete response progressively.

Key considerations:

  • Content chunks may be empty strings (especially the first and last chunks)
  • Tool call chunks contain incremental JSON fragments that must be accumulated
  • The + operator on AIMessageChunk handles proper aggregation of content, tool calls, and metadata
  • response_metadata is typically only populated on the final chunk

Step 5: Accumulate the Final Response

After all chunks have been yielded, the accumulated result is equivalent to what invoke() would have returned. If you need the complete AIMessage, concatenate all chunks. LangChain's internal streaming accumulation handles this automatically when invoke() routes through the streaming path (via _should_stream()). Usage metadata (input tokens, output tokens) may appear on the final chunk or be distributed across chunks depending on the provider.

Key considerations:

  • The last chunk typically contains the complete usage metadata
  • Chunk concatenation preserves all tool calls and response metadata
  • If using streaming for invoke() internally, the framework handles accumulation transparently
  • stream_usage=True distributes usage info across chunks for some providers

Step 6: Handle Streaming Events (Advanced)

For complex chains and agents, use astream_events() to receive a stream of typed events from every component in the chain. Events include on_chat_model_stream (chunk from model), on_chain_start/end (chain lifecycle), on_tool_start/end (tool execution), and more. Each event includes the event type, name, data (the chunk or result), and run ID for tracing.

Key considerations:

  • astream_events() provides visibility into multi-step chain execution
  • Event version "v2" is the current recommended version
  • Events are useful for progress indicators in complex agent workflows
  • Run IDs allow correlating events across nested chain components

Execution Diagram

GitHub URL

Workflow Repository