Implementation:OpenGVLab InternVL Streamlit Model Worker

Knowledge Sources	OpenGVLab_InternVL
Domains	Model Serving, Inference, Dynamic Resolution, FastAPI
Last Updated	2026-02-07 14:00 GMT

Overview

This module implements a FastAPI-based model worker that loads InternVL models for streaming inference, with dynamic image preprocessing, multi-GPU device mapping, and controller registration for distributed deployment.

Description

The model_worker.py file is the core inference server component for the Streamlit demo, implementing the ModelWorker class and associated utility functions:

Image Preprocessing:

build_transform: Creates an ImageNet-normalized transform pipeline (RGB conversion, bicubic resize to input_size, ToTensor, Normalize)
find_closest_aspect_ratio: Finds the optimal tile grid layout matching the input image's aspect ratio from all valid (i,j) combinations where i*j is within [min_num, max_num]
dynamic_preprocess: Splits images into aspect-ratio-aware tiles of image_size x image_size pixels, with optional thumbnail addition for multi-tile inputs

Model Loading:

split_model: Computes per-GPU device maps for multi-GPU deployment across the InternVL model family (from 1B to 78B parameters), reserving GPU 0 for the vision model with a configurable vit_alpha factor
ModelWorker.__init__: Loads models via AutoModel with optional 8-bit quantization and multi-GPU device mapping, initializes tokenizer with special token handling

Inference:

generate_stream: Processes multi-turn conversations with base64-decoded images, manages tile allocation (history images get 1 tile, current images share the budget), uses TextIteratorStreamer for streaming token generation in a separate thread
generate_stream_gate: Error-handling wrapper with model reload capability for CUDA and value errors

Infrastructure:

Controller registration and periodic heartbeat sending in a background thread
Concurrency control via asyncio Semaphore (configurable limit)
FastAPI endpoints: /worker_generate_stream and /worker_get_status

Usage

Use this module to deploy InternVL models as inference workers that integrate with the controller/worker serving architecture. It is the backend that processes requests from the Streamlit chat app.

Code Reference

Source Location

Repository: OpenGVLab_InternVL
File: streamlit_demo/model_worker.py
Lines: 1-449

Signature

class ModelWorker:
    def __init__(self, controller_addr, worker_addr, worker_id,
                 model_path, model_name, load_8bit, device,
                 context_len=8192): ...
    def generate_stream(self, params) -> Generator[bytes]: ...
    def generate_stream_gate(self, params) -> Generator[bytes]: ...
    def register_to_controller(self) -> None: ...
    def send_heart_beat(self) -> None: ...
    def reload_model(self) -> None: ...

def build_transform(input_size) -> T.Compose
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size) -> tuple
def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False) -> list[Image]
def split_model(model_name, vit_alpha=0.5) -> dict
def heart_beat_worker(controller) -> None

Import

from model_worker import ModelWorker, build_transform, dynamic_preprocess, split_model

I/O Contract

Inputs

Name	Type	Required	Description
--model-path	str	Yes	Path to the InternVL model checkpoint
--model-name	str	No	Override model name (derived from path if not specified)
--controller-address	str	No	Controller URL (default: "http://localhost:21001")
--worker-address	str	No	This worker's address (default: "http://localhost:21002")
--device	str	No	Device: "cuda" or "auto" for multi-GPU (default: "cuda")
--load-8bit	bool	No	Enable 8-bit quantized loading
--limit-model-concurrency	int	No	Max concurrent requests (default: 5)
params.prompt	list[dict]	Yes	Conversation messages with role, content, and optional base64 images
params.max_input_tiles	int	Yes	Maximum number of image tiles for dynamic resolution
params.temperature	float	Yes	Sampling temperature
params.top_p	float	Yes	Top-p sampling parameter
params.max_new_tokens	int	Yes	Maximum tokens to generate

Outputs

Name	Type	Description
Streaming response	bytes (JSON + \0 delimiter)	JSON-encoded chunks with 'text' and 'error_code' fields
Worker status	dict	Model names, speed, and queue length

Usage Examples

Basic Usage

# Launch a model worker
# python streamlit_demo/model_worker.py \
#     --model-path OpenGVLab/InternVL2-8B \
#     --controller-address http://localhost:21001 \
#     --worker-address http://localhost:21002 \
#     --port 21002 \
#     --device auto

# Multi-GPU deployment for large models
# python streamlit_demo/model_worker.py \
#     --model-path OpenGVLab/InternVL2-78B \
#     --device auto --load-8bit

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment