Implementation:OpenGVLab InternVL Streamlit Model Worker
| Knowledge Sources | |
|---|---|
| Domains | Model Serving, Inference, Dynamic Resolution, FastAPI |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
This module implements a FastAPI-based model worker that loads InternVL models for streaming inference, with dynamic image preprocessing, multi-GPU device mapping, and controller registration for distributed deployment.
Description
The model_worker.py file is the core inference server component for the Streamlit demo, implementing the ModelWorker class and associated utility functions:
Image Preprocessing:
- build_transform: Creates an ImageNet-normalized transform pipeline (RGB conversion, bicubic resize to input_size, ToTensor, Normalize)
- find_closest_aspect_ratio: Finds the optimal tile grid layout matching the input image's aspect ratio from all valid (i,j) combinations where i*j is within [min_num, max_num]
- dynamic_preprocess: Splits images into aspect-ratio-aware tiles of image_size x image_size pixels, with optional thumbnail addition for multi-tile inputs
Model Loading:
- split_model: Computes per-GPU device maps for multi-GPU deployment across the InternVL model family (from 1B to 78B parameters), reserving GPU 0 for the vision model with a configurable vit_alpha factor
- ModelWorker.__init__: Loads models via AutoModel with optional 8-bit quantization and multi-GPU device mapping, initializes tokenizer with special token handling
Inference:
- generate_stream: Processes multi-turn conversations with base64-decoded images, manages tile allocation (history images get 1 tile, current images share the budget), uses TextIteratorStreamer for streaming token generation in a separate thread
- generate_stream_gate: Error-handling wrapper with model reload capability for CUDA and value errors
Infrastructure:
- Controller registration and periodic heartbeat sending in a background thread
- Concurrency control via asyncio Semaphore (configurable limit)
- FastAPI endpoints: /worker_generate_stream and /worker_get_status
Usage
Use this module to deploy InternVL models as inference workers that integrate with the controller/worker serving architecture. It is the backend that processes requests from the Streamlit chat app.
Code Reference
Source Location
- Repository: OpenGVLab_InternVL
- File: streamlit_demo/model_worker.py
- Lines: 1-449
Signature
class ModelWorker:
def __init__(self, controller_addr, worker_addr, worker_id,
model_path, model_name, load_8bit, device,
context_len=8192): ...
def generate_stream(self, params) -> Generator[bytes]: ...
def generate_stream_gate(self, params) -> Generator[bytes]: ...
def register_to_controller(self) -> None: ...
def send_heart_beat(self) -> None: ...
def reload_model(self) -> None: ...
def build_transform(input_size) -> T.Compose
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size) -> tuple
def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False) -> list[Image]
def split_model(model_name, vit_alpha=0.5) -> dict
def heart_beat_worker(controller) -> None
Import
from model_worker import ModelWorker, build_transform, dynamic_preprocess, split_model
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| --model-path | str | Yes | Path to the InternVL model checkpoint |
| --model-name | str | No | Override model name (derived from path if not specified) |
| --controller-address | str | No | Controller URL (default: "http://localhost:21001") |
| --worker-address | str | No | This worker's address (default: "http://localhost:21002") |
| --device | str | No | Device: "cuda" or "auto" for multi-GPU (default: "cuda") |
| --load-8bit | bool | No | Enable 8-bit quantized loading |
| --limit-model-concurrency | int | No | Max concurrent requests (default: 5) |
| params.prompt | list[dict] | Yes | Conversation messages with role, content, and optional base64 images |
| params.max_input_tiles | int | Yes | Maximum number of image tiles for dynamic resolution |
| params.temperature | float | Yes | Sampling temperature |
| params.top_p | float | Yes | Top-p sampling parameter |
| params.max_new_tokens | int | Yes | Maximum tokens to generate |
Outputs
| Name | Type | Description |
|---|---|---|
| Streaming response | bytes (JSON + \0 delimiter) | JSON-encoded chunks with 'text' and 'error_code' fields |
| Worker status | dict | Model names, speed, and queue length |
Usage Examples
Basic Usage
# Launch a model worker
# python streamlit_demo/model_worker.py \
# --model-path OpenGVLab/InternVL2-8B \
# --controller-address http://localhost:21001 \
# --worker-address http://localhost:21002 \
# --port 21002 \
# --device auto
# Multi-GPU deployment for large models
# python streamlit_demo/model_worker.py \
# --model-path OpenGVLab/InternVL2-78B \
# --device auto --load-8bit