Implementation:Mit han lab Llm awq Model Worker VILA

Knowledge Sources	Mit_han_lab_Llm_awq
Domains	Serving, Inference
Last Updated	2026-02-15 00:00 GMT

Overview

The ModelWorker (VILA variant) is an alternative model worker that loads VILA-architecture models using VilaLlamaForCausalLM with separate LLM and vision tower components, serving streaming inference via FastAPI.

Description

This module provides a VILA-specific model worker for the TinyChat distributed serving stack. Unlike the standard model worker which uses LlavaLlamaForCausalLM, this variant imports and instantiates VilaLlamaForCausalLM from tinychat.models.vila_llama. A key architectural difference is that the VILA model separates the LLM backbone from the vision tower: the tokenizer is loaded from the model_path/llm subdirectory (via AutoTokenizer.from_pretrained(os.path.join(args.model_path, "llm"))), and checkpoint loading/quantization is applied specifically to model.llm rather than the entire model. The vision tower is accessed through model.get_vision_tower() (rather than model.get_model().vision_tower as in the LLaVA worker). For W16A16 precision, load_checkpoint_and_dispatch is applied to model.llm from the model_path/llm path, followed by moving the full model to the device and calling model.eval(). For W4A16 AWQ quantization, load_awq_model, make_quant_attn, and make_quant_norm are applied to model.llm specifically. The rest of the serving infrastructure is identical to the standard worker: controller registration, heartbeat management, asyncio semaphore-based concurrency control, streaming generation via LlavaStreamGenerator, error-wrapped generation gate, and the same FastAPI endpoints (/worker_generate_stream and /worker_get_status). The multimodal detection logic checks for both "llava" and "vila" in the model name.

Usage

Run this module as a standalone FastAPI server for VILA-architecture models. Use this instead of the standard model_worker when your model directory follows the VILA layout with a separate llm subdirectory.

Code Reference

Source Location

Repository: Mit_han_lab_Llm_awq
File: tinychat/serve/model_worker_new.py
Lines: 1-446

Signature

class ModelWorker:
    def __init__(
        self,
        controller_addr: str,
        worker_addr: str,
        worker_id: str,
        no_register: bool,
        model_type: str,
        model_path: str,
        model_name: str,
        quant_path: str,
        precision: str,
        device: str,
    ): ...

    def register_to_controller(self) -> None: ...
    def send_heart_beat(self) -> None: ...
    def get_queue_length(self) -> int: ...
    def get_status(self) -> dict: ...

    @torch.inference_mode()
    def generate_stream(self, params: dict) -> Generator[bytes, None, None]: ...
    def generate_stream_gate(self, params: dict) -> Generator[bytes, None, None]: ...

Import

# Run as a standalone VILA model worker:
# python -m tinychat.serve.model_worker_new \
#     --model-path /path/to/vila-model \
#     --quant-path /path/to/vila-awq-weights.pt \
#     --precision W4A16 \
#     --controller-address http://localhost:21001 \
#     --port 21002

I/O Contract

Inputs

Name	Type	Required	Description
controller_addr	str	Yes	URL of the controller (default: http://localhost:21001)
worker_addr	str	Yes	This worker's URL for the controller to reach back (default: http://localhost:21002)
model_type	str	Yes	Base language model type, e.g. "LLaMa"
model_path	str	Yes	Path to the VILA model directory (must contain an "llm" subdirectory with the tokenizer)
quant_path	str	No	Path to AWQ quantized weights for the LLM backbone (required for W4A16)
precision	str	Yes	Quantization precision: "W16A16" (full) or "W4A16" (AWQ 4-bit)
device	str	Yes	Target device, e.g. "cuda"
params.prompt	str	Yes	The text prompt including image token placeholders
params.images	List[str]	No	Base64-encoded images corresponding to <image> tokens in the prompt
params.temperature	float	No	Sampling temperature (default: 1.0)
params.top_p	float	No	Top-p nucleus sampling parameter (default: 1.0)
params.max_new_tokens	int	No	Maximum tokens to generate (default: 256, capped at 1024)

Outputs

Name	Type	Description
streaming_response	StreamingResponse	Newline-delimited JSON chunks with "text" and "error_code" fields, null-byte delimited
status	dict	Worker status with keys "model_names" (List[str]), "speed" (int), "queue_length" (int)

Usage Examples

Launching a VILA AWQ Worker

# python -m tinychat.serve.model_worker_new \
#     --model-path /models/VILA-7b \
#     --quant-path /models/VILA-7b-awq.pt \
#     --precision W4A16 \
#     --controller-address http://localhost:21001 \
#     --worker-address http://localhost:21002 \
#     --port 21002

Launching a Full-Precision VILA Worker

# python -m tinychat.serve.model_worker_new \
#     --model-path /models/VILA-7b \
#     --precision W16A16 \
#     --controller-address http://localhost:21001 \
#     --port 21003

Key Differences from Standard Model Worker

# Standard worker (model_worker.py):
#   - Uses LlavaLlamaForCausalLM
#   - Tokenizer loaded from model_path directly
#   - Vision tower accessed via model.get_model().vision_tower
#   - Quantization applied to entire model

# VILA worker (model_worker_new.py):
#   - Uses VilaLlamaForCausalLM
#   - Tokenizer loaded from model_path/llm subdirectory
#   - Vision tower accessed via model.get_vision_tower()
#   - Quantization applied to model.llm only

Related Pages

Principle:Mit_han_lab_Llm_awq_Distributed_Model_Serving

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment