Implementation:Lm sys FastChat Build Single Model Vision UI
| Knowledge Sources | |
|---|---|
| Domains | Web_UI, Model_Evaluation |
| Last Updated | 2026-02-07 06:00 GMT |
Overview
Constructs a single-model vision chat UI with multimodal image upload support for evaluating vision-language models.
Description
The build_single_vision_language_model_ui function creates a Gradio UI component that enables users to interact with a single vision-language model. Users can upload images via a MultimodalTextbox input and ask questions about them. The UI includes an image display panel that dynamically shows or hides based on whether an image has been uploaded, controlled by set_visible_image and set_invisible_image.
The module provides robust image handling through several utility functions. add_image extracts image files from the multimodal textbox input. convert_images_to_conversation_format transforms uploaded images into the internal Image class format expected by the conversation system, supporting both file paths and base64-encoded data via the ImageFormat and Image classes from fastchat.serve.vision.image. The _prepare_text_with_image function assembles the final prompt by combining text and image references into the conversation state, handling CSAM (child safety) flagging.
The moderate_input function performs dual-layer content moderation: text moderation via moderation_filter and image moderation via image_moderation_filter, returning appropriate warning messages (TEXT_MODERATION_MSG or IMAGE_MODERATION_MSG). The add_text function orchestrates the full input pipeline -- validating character limits, running moderation, converting images, and appending the user turn to conversation state. Optional VQA sample loading is supported through get_vqa_sample which selects random visual question-answering examples for demonstration.
Usage
Use this module when building a single-model vision chat tab in the Chatbot Arena interface. It serves as the foundation for the multimodal direct chat experience and also exports key utility functions (set_visible_image, set_invisible_image, add_image, moderate_input, _prepare_text_with_image, convert_images_to_conversation_format) that are reused by the vision arena modules (both anonymous and named).
Code Reference
Source Location
- Repository: Lm_sys_FastChat
- File: fastchat/serve/gradio_block_arena_vision.py
- Lines: 1-511
Signature
def build_single_vision_language_model_ui(
context: Context, add_promotion_links=False, random_questions=None
):
"""
Build a single vision-language model chat UI.
Args:
context: Global Context object containing model lists and configuration.
add_promotion_links: Whether to display blog/paper/social promotion links.
random_questions: Optional list of VQA sample questions for the random button.
Returns:
list: [state, model_selector] Gradio State and Dropdown components.
"""
Import
from fastchat.serve.gradio_block_arena_vision import build_single_vision_language_model_ui
Key Functions
| Function | Line | Description |
|---|---|---|
| build_single_vision_language_model_ui | 298 | Main entry point; constructs the single-model vision chat Gradio tab |
| get_vqa_sample | 70 | Selects a random VQA sample with question text and image path |
| set_visible_image | 77 | Shows the image display column when an image is uploaded |
| set_invisible_image | 89 | Hides the image display column |
| add_image | 93 | Extracts image files from multimodal textbox input |
| vote_last_response | 101 | Logs user vote (upvote/downvote/flag) for single-model evaluation |
| upvote_last_response | 115 | Records an upvote for the model response |
| downvote_last_response | 122 | Records a downvote for the model response |
| flag_last_response | 129 | Flags the model response for review |
| regenerate | 136 | Clears last assistant message and regenerates a new response |
| clear_history | 146 | Resets conversation state, chatbot display, and image panel |
| _prepare_text_with_image | 169 | Assembles prompt by combining text and image references into state |
| convert_images_to_conversation_format | 181 | Converts uploaded images to internal Image class format |
| moderate_input | 194 | Performs text and image content moderation checks |
| add_text | 219 | Full input pipeline: validation, moderation, image conversion, state update |
| report_csam_image | 165 | Reports detected CSAM content for safety compliance |
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| context | Context | Yes | Global state object from fastchat.serve.gradio_global_state containing model lists and configuration |
| add_promotion_links | bool | No | Whether to display promotional links in the notice markdown (default: False) |
| random_questions | list | No | Optional list of VQA sample dicts with "question" and "path" keys for the random example button |
Outputs
| Name | Type | Description |
|---|---|---|
| returns | list | List of [state, model_selector] containing a single Gradio State and a Dropdown component |
Dependencies
Internal Imports
from fastchat.constants import (
TEXT_MODERATION_MSG, IMAGE_MODERATION_MSG, MODERATION_MSG,
CONVERSATION_LIMIT_MSG, INPUT_CHAR_LEN_LIMIT,
CONVERSATION_TURN_LIMIT, SURVEY_LINK,
)
from fastchat.model.model_adapter import get_conversation_template
from fastchat.serve.gradio_global_state import Context
from fastchat.serve.gradio_web_server import (
get_model_description_md, acknowledgment_md, bot_response,
get_ip, disable_btn, State, get_conv_log_filename, get_remote_logger,
)
from fastchat.serve.vision.image import ImageFormat, Image
from fastchat.utils import build_logger, moderation_filter, image_moderation_filter
External Imports
import json
import os
import time
from typing import List, Union
import gradio as gr
from gradio.data_classes import FileData
import numpy as np
Usage Examples
# Building the single vision-language model tab
import gradio as gr
from fastchat.serve.gradio_global_state import Context
from fastchat.serve.gradio_block_arena_vision import (
build_single_vision_language_model_ui,
)
context = Context()
context.text_models = ["llava-v1.5-7b", "llava-v1.5-13b"]
vqa_samples = [
{"question": "What is in this image?", "path": "/data/samples/cat.jpg"},
{"question": "Describe the scene.", "path": "/data/samples/street.jpg"},
]
with gr.Blocks() as demo:
with gr.Tab("Vision Direct Chat"):
state_and_selector = build_single_vision_language_model_ui(
context,
add_promotion_links=True,
random_questions=vqa_samples,
)