Implementation:Mit han lab Llm awq Model Worker VILA
| Knowledge Sources | |
|---|---|
| Domains | Serving, Inference |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
The ModelWorker (VILA variant) is an alternative model worker that loads VILA-architecture models using VilaLlamaForCausalLM with separate LLM and vision tower components, serving streaming inference via FastAPI.
Description
This module provides a VILA-specific model worker for the TinyChat distributed serving stack. Unlike the standard model worker which uses LlavaLlamaForCausalLM, this variant imports and instantiates VilaLlamaForCausalLM from tinychat.models.vila_llama. A key architectural difference is that the VILA model separates the LLM backbone from the vision tower: the tokenizer is loaded from the model_path/llm subdirectory (via AutoTokenizer.from_pretrained(os.path.join(args.model_path, "llm"))), and checkpoint loading/quantization is applied specifically to model.llm rather than the entire model. The vision tower is accessed through model.get_vision_tower() (rather than model.get_model().vision_tower as in the LLaVA worker). For W16A16 precision, load_checkpoint_and_dispatch is applied to model.llm from the model_path/llm path, followed by moving the full model to the device and calling model.eval(). For W4A16 AWQ quantization, load_awq_model, make_quant_attn, and make_quant_norm are applied to model.llm specifically. The rest of the serving infrastructure is identical to the standard worker: controller registration, heartbeat management, asyncio semaphore-based concurrency control, streaming generation via LlavaStreamGenerator, error-wrapped generation gate, and the same FastAPI endpoints (/worker_generate_stream and /worker_get_status). The multimodal detection logic checks for both "llava" and "vila" in the model name.
Usage
Run this module as a standalone FastAPI server for VILA-architecture models. Use this instead of the standard model_worker when your model directory follows the VILA layout with a separate llm subdirectory.
Code Reference
Source Location
- Repository: Mit_han_lab_Llm_awq
- File: tinychat/serve/model_worker_new.py
- Lines: 1-446
Signature
class ModelWorker:
def __init__(
self,
controller_addr: str,
worker_addr: str,
worker_id: str,
no_register: bool,
model_type: str,
model_path: str,
model_name: str,
quant_path: str,
precision: str,
device: str,
): ...
def register_to_controller(self) -> None: ...
def send_heart_beat(self) -> None: ...
def get_queue_length(self) -> int: ...
def get_status(self) -> dict: ...
@torch.inference_mode()
def generate_stream(self, params: dict) -> Generator[bytes, None, None]: ...
def generate_stream_gate(self, params: dict) -> Generator[bytes, None, None]: ...
Import
# Run as a standalone VILA model worker:
# python -m tinychat.serve.model_worker_new \
# --model-path /path/to/vila-model \
# --quant-path /path/to/vila-awq-weights.pt \
# --precision W4A16 \
# --controller-address http://localhost:21001 \
# --port 21002
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| controller_addr | str | Yes | URL of the controller (default: http://localhost:21001) |
| worker_addr | str | Yes | This worker's URL for the controller to reach back (default: http://localhost:21002) |
| model_type | str | Yes | Base language model type, e.g. "LLaMa" |
| model_path | str | Yes | Path to the VILA model directory (must contain an "llm" subdirectory with the tokenizer) |
| quant_path | str | No | Path to AWQ quantized weights for the LLM backbone (required for W4A16) |
| precision | str | Yes | Quantization precision: "W16A16" (full) or "W4A16" (AWQ 4-bit) |
| device | str | Yes | Target device, e.g. "cuda" |
| params.prompt | str | Yes | The text prompt including image token placeholders |
| params.images | List[str] | No | Base64-encoded images corresponding to <image> tokens in the prompt |
| params.temperature | float | No | Sampling temperature (default: 1.0) |
| params.top_p | float | No | Top-p nucleus sampling parameter (default: 1.0) |
| params.max_new_tokens | int | No | Maximum tokens to generate (default: 256, capped at 1024) |
Outputs
| Name | Type | Description |
|---|---|---|
| streaming_response | StreamingResponse | Newline-delimited JSON chunks with "text" and "error_code" fields, null-byte delimited |
| status | dict | Worker status with keys "model_names" (List[str]), "speed" (int), "queue_length" (int) |
Usage Examples
Launching a VILA AWQ Worker
# python -m tinychat.serve.model_worker_new \
# --model-path /models/VILA-7b \
# --quant-path /models/VILA-7b-awq.pt \
# --precision W4A16 \
# --controller-address http://localhost:21001 \
# --worker-address http://localhost:21002 \
# --port 21002
Launching a Full-Precision VILA Worker
# python -m tinychat.serve.model_worker_new \
# --model-path /models/VILA-7b \
# --precision W16A16 \
# --controller-address http://localhost:21001 \
# --port 21003
Key Differences from Standard Model Worker
# Standard worker (model_worker.py):
# - Uses LlavaLlamaForCausalLM
# - Tokenizer loaded from model_path directly
# - Vision tower accessed via model.get_model().vision_tower
# - Quantization applied to entire model
# VILA worker (model_worker_new.py):
# - Uses VilaLlamaForCausalLM
# - Tokenizer loaded from model_path/llm subdirectory
# - Vision tower accessed via model.get_vision_tower()
# - Quantization applied to model.llm only