Implementation:Mit han lab Llm awq InternVLStreamGenerator

Knowledge Sources	Mit_han_lab_Llm_awq
Domains	NLP, Inference
Last Updated	2026-02-15 00:00 GMT

Overview

Token-by-token streaming text generation for InternVL3 multimodal models, handling both image and video media through special token placeholders (IMG_START_TOKEN, IMG_END_TOKEN, IMG_CONTEXT_TOKEN).

Description

InternVLStreamGenerator is a generator function decorated with @torch.inference_mode() that provides streaming inference for InternVL3 vision-language models. It extends the common streaming pattern used by other generators in the codebase but adds InternVL-specific media token expansion for both images and videos.

Before tokenization, the function expands <image> placeholders in the input text into sequences of InternVL-specific tokens. For each image or video in the media dictionary, it computes num_patches from the tensor size, then constructs a token string of the form <img> + IMG_CONTEXT_TOKEN * (NUM_IMAGE_TOKEN * num_patches) + </img>, where NUM_IMAGE_TOKEN is fixed at 256. This token string replaces the corresponding <image> placeholder in order. The img_context_token_id is set on the model by converting IMG_CONTEXT_TOKEN via the tokenizer, enabling the model to identify visual token positions.

The function handles both "image" and "video" keys in the media dictionary independently, processing each with the same patch-based token expansion logic. This unified approach allows InternVL3 to handle mixed media types within a single prompt.

Like NVILAStreamGenerator, it delegates forward passes to model.stream_gen(), which returns logits and consumed sequence length. Logits processing uses prepare_logits_processor from llava_stream_gen for temperature scaling, top-p, and top-k filtering. The function includes numerical stability checks for Inf values in logits and Inf/NaN in probabilities.

For multi-turn conversations with chunk_prefilling, the input is prepended with <|im_start|> when start_pos is non-zero, consistent with the Qwen2 chat template used by InternVL3's language backbone.

Usage

Use this generator for streaming inference with InternVL3 models in the internvl_demo.py interactive demo. It is set as the stream generator and called in the main chat loop with preprocessed media and media configuration.

Code Reference

Source Location

Repository: Mit_han_lab_Llm_awq
File: tinychat/stream_generators/internvl_stream_gen.py
Lines: 1-204

Signature

@torch.inference_mode()
def InternVLStreamGenerator(
    model,
    gen_params,
    input: str,
    media=None,
    media_cfg=None,
    start_pos: int = 0,
    device: str = "cuda:0",
    stream_interval: int = 2,
    echo: bool = False,
    stop_token_ids=[],
    image_tensor: Optional[torch.FloatTensor] = None,
    chunk_prefilling: bool = False,
    quant_llm: bool = False,
):

Import

from tinychat.stream_generators.internvl_stream_gen import InternVLStreamGenerator

I/O Contract

Inputs

Parameter	Type	Description
model	nn.Module	InternVL3 model with .stream_gen(), .tokenizer, and .img_context_token_id attributes
gen_params	object	Generation params with attrs: temp, repeat_penalty, top_p, top_k, n_vocab, n_predict
input	str	Text prompt with `<image>` placeholders
media	dict or None	Dict with "image" and/or "video" keys mapping to lists of tensors
media_cfg	dict or None	Media configuration metadata
start_pos	int	Starting KV cache position (default: 0)
device	str	CUDA device string (default: "cuda:0")
stream_interval	int	Yield decoded text every N tokens (default: 2)
echo	bool	If True, include input tokens in output
stop_token_ids	list[int]	Additional stop token IDs beyond EOS
image_tensor	Optional[torch.FloatTensor]	Legacy parameter for compatibility (unused)
chunk_prefilling	bool	Enable chunk prefilling for multi-turn speedup
quant_llm	bool	Whether the LLM backbone is quantized

Outputs

Yields	Type	Description
result	dict	Keys: "text" (str), "usage" (dict with prompt_tokens, completion_tokens, total_tokens), "finish_reason" (None/"stop"/"length"), "timing" (dict or None with context_tokens, context_time, total_tokens, generation_time_list)

InternVL Token Expansion Constants

Constant	Value	Description
IMG_START_TOKEN	`<img>`	Marks the beginning of an image token sequence
IMG_END_TOKEN	`</img>`	Marks the end of an image token sequence
IMG_CONTEXT_TOKEN	`<IMG_CONTEXT>`	Placeholder repeated for each visual token
NUM_IMAGE_TOKEN	256	Number of visual tokens per image patch

Usage Examples

from tinychat.stream_generators.internvl_stream_gen import InternVLStreamGenerator

# Streaming generation with InternVL3 model (image input)
media = {"image": [image_tensor]}  # preprocessed image patches
for output in InternVLStreamGenerator(
    model=internvl_model,
    gen_params=gen_params,
    input="<image>\nWhat do you see in this image?",
    media=media,
    media_cfg=media_cfg,
    start_pos=0,
    device="cuda:0",
    stop_token_ids=[],
    chunk_prefilling=True,
    quant_llm=True,
):
    if output["finish_reason"] is None:
        print(output["text"], end="", flush=True)
    else:
        print(f"\nFinished: {output['finish_reason']}")
        print(f"Context time: {output['timing']['context_time']:.3f}s")

# Video input example
media = {"video": [video_frames_tensor]}
for output in InternVLStreamGenerator(
    model=internvl_model,
    gen_params=gen_params,
    input="<image>\nDescribe what happens in this video.",
    media=media,
    media_cfg=media_cfg,
    start_pos=0,
    device="cuda:0",
    stop_token_ids=[],
):
    print(output["text"], end="", flush=True)

Related Pages

Principle:Mit_han_lab_Llm_awq_Streaming_Text_Generation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment