Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:OpenGVLab InternVL Streamlit Model Worker

From Leeroopedia


Knowledge Sources
Domains Model Serving, Inference, Dynamic Resolution, FastAPI
Last Updated 2026-02-07 14:00 GMT

Overview

This module implements a FastAPI-based model worker that loads InternVL models for streaming inference, with dynamic image preprocessing, multi-GPU device mapping, and controller registration for distributed deployment.

Description

The model_worker.py file is the core inference server component for the Streamlit demo, implementing the ModelWorker class and associated utility functions:

Image Preprocessing:

  • build_transform: Creates an ImageNet-normalized transform pipeline (RGB conversion, bicubic resize to input_size, ToTensor, Normalize)
  • find_closest_aspect_ratio: Finds the optimal tile grid layout matching the input image's aspect ratio from all valid (i,j) combinations where i*j is within [min_num, max_num]
  • dynamic_preprocess: Splits images into aspect-ratio-aware tiles of image_size x image_size pixels, with optional thumbnail addition for multi-tile inputs

Model Loading:

  • split_model: Computes per-GPU device maps for multi-GPU deployment across the InternVL model family (from 1B to 78B parameters), reserving GPU 0 for the vision model with a configurable vit_alpha factor
  • ModelWorker.__init__: Loads models via AutoModel with optional 8-bit quantization and multi-GPU device mapping, initializes tokenizer with special token handling

Inference:

  • generate_stream: Processes multi-turn conversations with base64-decoded images, manages tile allocation (history images get 1 tile, current images share the budget), uses TextIteratorStreamer for streaming token generation in a separate thread
  • generate_stream_gate: Error-handling wrapper with model reload capability for CUDA and value errors

Infrastructure:

  • Controller registration and periodic heartbeat sending in a background thread
  • Concurrency control via asyncio Semaphore (configurable limit)
  • FastAPI endpoints: /worker_generate_stream and /worker_get_status

Usage

Use this module to deploy InternVL models as inference workers that integrate with the controller/worker serving architecture. It is the backend that processes requests from the Streamlit chat app.

Code Reference

Source Location

Signature

class ModelWorker:
    def __init__(self, controller_addr, worker_addr, worker_id,
                 model_path, model_name, load_8bit, device,
                 context_len=8192): ...
    def generate_stream(self, params) -> Generator[bytes]: ...
    def generate_stream_gate(self, params) -> Generator[bytes]: ...
    def register_to_controller(self) -> None: ...
    def send_heart_beat(self) -> None: ...
    def reload_model(self) -> None: ...

def build_transform(input_size) -> T.Compose
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size) -> tuple
def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False) -> list[Image]
def split_model(model_name, vit_alpha=0.5) -> dict
def heart_beat_worker(controller) -> None

Import

from model_worker import ModelWorker, build_transform, dynamic_preprocess, split_model

I/O Contract

Inputs

Name Type Required Description
--model-path str Yes Path to the InternVL model checkpoint
--model-name str No Override model name (derived from path if not specified)
--controller-address str No Controller URL (default: "http://localhost:21001")
--worker-address str No This worker's address (default: "http://localhost:21002")
--device str No Device: "cuda" or "auto" for multi-GPU (default: "cuda")
--load-8bit bool No Enable 8-bit quantized loading
--limit-model-concurrency int No Max concurrent requests (default: 5)
params.prompt list[dict] Yes Conversation messages with role, content, and optional base64 images
params.max_input_tiles int Yes Maximum number of image tiles for dynamic resolution
params.temperature float Yes Sampling temperature
params.top_p float Yes Top-p sampling parameter
params.max_new_tokens int Yes Maximum tokens to generate

Outputs

Name Type Description
Streaming response bytes (JSON + \0 delimiter) JSON-encoded chunks with 'text' and 'error_code' fields
Worker status dict Model names, speed, and queue length

Usage Examples

Basic Usage

# Launch a model worker
# python streamlit_demo/model_worker.py \
#     --model-path OpenGVLab/InternVL2-8B \
#     --controller-address http://localhost:21001 \
#     --worker-address http://localhost:21002 \
#     --port 21002 \
#     --device auto

# Multi-GPU deployment for large models
# python streamlit_demo/model_worker.py \
#     --model-path OpenGVLab/InternVL2-78B \
#     --device auto --load-8bit

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment