Principle:OpenGVLab InternVL Streamlit Chat Interface

Knowledge Sources	OpenGVLab_InternVL
Domains	Web Application, Multimodal Chat, Streamlit UI, Visual Grounding
Last Updated	2026-02-07 14:00 GMT

Overview

Streamlit Chat Interface is the principle of building rich multimodal chat applications using Streamlit's reactive framework, integrating model inference with post-processing capabilities such as bounding box visualization and image generation.

Description

This principle covers the design of production-quality chat demos for multimodal models using Streamlit, emphasizing several patterns:

Session state management: Streamlit's session_state stores the full conversation history including text messages, PIL images, and metadata across reruns. Each message carries its role ('user' or 'assistant'), content, and optional image attachments, enabling multi-turn conversations with persistent context.

Multi-image conversation flow: Images are uploaded via file_uploader, converted to PIL format, hashed for deduplication and logging, and attached to user messages. The tile budget management in the model worker ensures that historical images receive minimal resolution (1 tile each) while the current turn's images share the full tile budget, balancing context length with image detail.

Response post-processing pipeline: Model outputs are parsed for structured markup:
- Visual grounding: <ref>category</ref><box>x1,y1,x2,y2</box> tags are parsed via regex, coordinates are scaled from 0-1000 normalized space to pixel coordinates, and colored bounding boxes with labels are drawn on the last uploaded image
- Image generation: drawing-instruction code blocks trigger calls to an external Stable Diffusion worker, with the generated image displayed in the chat

Bilingual internationalization: The entire UI (labels, captions, error messages, example prompts) is duplicated for English and Chinese, controlled by a language selector. This demonstrates the pattern of maintaining parallel UI text without a full i18n framework.

Gallery-based onboarding: Pre-loaded example images with associated prompts allow users to immediately experience model capabilities without uploading their own images, reducing friction for first-time users.

Usage

Apply this principle when building interactive demonstration applications for multimodal models that need rich UI features beyond simple text chat, including image display, bounding box visualization, multi-image support, and streaming responses.

Theoretical Basis

Streamlit's dataflow programming model re-executes the entire script on each user interaction, with session_state providing persistence between runs. This simplifies state management but requires careful attention to rerun behavior (e.g., using st.rerun() strategically, managing uploader keys to clear file inputs). The controller-worker separation follows the same distributed serving pattern as the Gradio-based demo, enabling model-agnostic frontends.

Related Pages

Implementation:OpenGVLab_InternVL_Streamlit_Chat_App

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment