Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Lm sys FastChat Build Single Model Vision UI

From Leeroopedia


Knowledge Sources
Domains Web_UI, Model_Evaluation
Last Updated 2026-02-07 06:00 GMT

Overview

Constructs a single-model vision chat UI with multimodal image upload support for evaluating vision-language models.

Description

The build_single_vision_language_model_ui function creates a Gradio UI component that enables users to interact with a single vision-language model. Users can upload images via a MultimodalTextbox input and ask questions about them. The UI includes an image display panel that dynamically shows or hides based on whether an image has been uploaded, controlled by set_visible_image and set_invisible_image.

The module provides robust image handling through several utility functions. add_image extracts image files from the multimodal textbox input. convert_images_to_conversation_format transforms uploaded images into the internal Image class format expected by the conversation system, supporting both file paths and base64-encoded data via the ImageFormat and Image classes from fastchat.serve.vision.image. The _prepare_text_with_image function assembles the final prompt by combining text and image references into the conversation state, handling CSAM (child safety) flagging.

The moderate_input function performs dual-layer content moderation: text moderation via moderation_filter and image moderation via image_moderation_filter, returning appropriate warning messages (TEXT_MODERATION_MSG or IMAGE_MODERATION_MSG). The add_text function orchestrates the full input pipeline -- validating character limits, running moderation, converting images, and appending the user turn to conversation state. Optional VQA sample loading is supported through get_vqa_sample which selects random visual question-answering examples for demonstration.

Usage

Use this module when building a single-model vision chat tab in the Chatbot Arena interface. It serves as the foundation for the multimodal direct chat experience and also exports key utility functions (set_visible_image, set_invisible_image, add_image, moderate_input, _prepare_text_with_image, convert_images_to_conversation_format) that are reused by the vision arena modules (both anonymous and named).

Code Reference

Source Location

Signature

def build_single_vision_language_model_ui(
    context: Context, add_promotion_links=False, random_questions=None
):
    """
    Build a single vision-language model chat UI.

    Args:
        context: Global Context object containing model lists and configuration.
        add_promotion_links: Whether to display blog/paper/social promotion links.
        random_questions: Optional list of VQA sample questions for the random button.

    Returns:
        list: [state, model_selector] Gradio State and Dropdown components.
    """

Import

from fastchat.serve.gradio_block_arena_vision import build_single_vision_language_model_ui

Key Functions

Function Line Description
build_single_vision_language_model_ui 298 Main entry point; constructs the single-model vision chat Gradio tab
get_vqa_sample 70 Selects a random VQA sample with question text and image path
set_visible_image 77 Shows the image display column when an image is uploaded
set_invisible_image 89 Hides the image display column
add_image 93 Extracts image files from multimodal textbox input
vote_last_response 101 Logs user vote (upvote/downvote/flag) for single-model evaluation
upvote_last_response 115 Records an upvote for the model response
downvote_last_response 122 Records a downvote for the model response
flag_last_response 129 Flags the model response for review
regenerate 136 Clears last assistant message and regenerates a new response
clear_history 146 Resets conversation state, chatbot display, and image panel
_prepare_text_with_image 169 Assembles prompt by combining text and image references into state
convert_images_to_conversation_format 181 Converts uploaded images to internal Image class format
moderate_input 194 Performs text and image content moderation checks
add_text 219 Full input pipeline: validation, moderation, image conversion, state update
report_csam_image 165 Reports detected CSAM content for safety compliance

I/O Contract

Inputs

Name Type Required Description
context Context Yes Global state object from fastchat.serve.gradio_global_state containing model lists and configuration
add_promotion_links bool No Whether to display promotional links in the notice markdown (default: False)
random_questions list No Optional list of VQA sample dicts with "question" and "path" keys for the random example button

Outputs

Name Type Description
returns list List of [state, model_selector] containing a single Gradio State and a Dropdown component

Dependencies

Internal Imports

from fastchat.constants import (
    TEXT_MODERATION_MSG, IMAGE_MODERATION_MSG, MODERATION_MSG,
    CONVERSATION_LIMIT_MSG, INPUT_CHAR_LEN_LIMIT,
    CONVERSATION_TURN_LIMIT, SURVEY_LINK,
)
from fastchat.model.model_adapter import get_conversation_template
from fastchat.serve.gradio_global_state import Context
from fastchat.serve.gradio_web_server import (
    get_model_description_md, acknowledgment_md, bot_response,
    get_ip, disable_btn, State, get_conv_log_filename, get_remote_logger,
)
from fastchat.serve.vision.image import ImageFormat, Image
from fastchat.utils import build_logger, moderation_filter, image_moderation_filter

External Imports

import json
import os
import time
from typing import List, Union
import gradio as gr
from gradio.data_classes import FileData
import numpy as np

Usage Examples

# Building the single vision-language model tab
import gradio as gr
from fastchat.serve.gradio_global_state import Context
from fastchat.serve.gradio_block_arena_vision import (
    build_single_vision_language_model_ui,
)

context = Context()
context.text_models = ["llava-v1.5-7b", "llava-v1.5-13b"]

vqa_samples = [
    {"question": "What is in this image?", "path": "/data/samples/cat.jpg"},
    {"question": "Describe the scene.", "path": "/data/samples/street.jpg"},
]

with gr.Blocks() as demo:
    with gr.Tab("Vision Direct Chat"):
        state_and_selector = build_single_vision_language_model_ui(
            context,
            add_promotion_links=True,
            random_questions=vqa_samples,
        )

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment