Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mit han lab Llm awq InternVLStreamGenerator

From Leeroopedia
Knowledge Sources
Domains NLP, Inference
Last Updated 2026-02-15 00:00 GMT

Overview

Token-by-token streaming text generation for InternVL3 multimodal models, handling both image and video media through special token placeholders (IMG_START_TOKEN, IMG_END_TOKEN, IMG_CONTEXT_TOKEN).

Description

InternVLStreamGenerator is a generator function decorated with @torch.inference_mode() that provides streaming inference for InternVL3 vision-language models. It extends the common streaming pattern used by other generators in the codebase but adds InternVL-specific media token expansion for both images and videos.

Before tokenization, the function expands <image> placeholders in the input text into sequences of InternVL-specific tokens. For each image or video in the media dictionary, it computes num_patches from the tensor size, then constructs a token string of the form <img> + IMG_CONTEXT_TOKEN * (NUM_IMAGE_TOKEN * num_patches) + </img>, where NUM_IMAGE_TOKEN is fixed at 256. This token string replaces the corresponding <image> placeholder in order. The img_context_token_id is set on the model by converting IMG_CONTEXT_TOKEN via the tokenizer, enabling the model to identify visual token positions.

The function handles both "image" and "video" keys in the media dictionary independently, processing each with the same patch-based token expansion logic. This unified approach allows InternVL3 to handle mixed media types within a single prompt.

Like NVILAStreamGenerator, it delegates forward passes to model.stream_gen(), which returns logits and consumed sequence length. Logits processing uses prepare_logits_processor from llava_stream_gen for temperature scaling, top-p, and top-k filtering. The function includes numerical stability checks for Inf values in logits and Inf/NaN in probabilities.

For multi-turn conversations with chunk_prefilling, the input is prepended with <|im_start|> when start_pos is non-zero, consistent with the Qwen2 chat template used by InternVL3's language backbone.

Usage

Use this generator for streaming inference with InternVL3 models in the internvl_demo.py interactive demo. It is set as the stream generator and called in the main chat loop with preprocessed media and media configuration.

Code Reference

Source Location

Signature

@torch.inference_mode()
def InternVLStreamGenerator(
    model,
    gen_params,
    input: str,
    media=None,
    media_cfg=None,
    start_pos: int = 0,
    device: str = "cuda:0",
    stream_interval: int = 2,
    echo: bool = False,
    stop_token_ids=[],
    image_tensor: Optional[torch.FloatTensor] = None,
    chunk_prefilling: bool = False,
    quant_llm: bool = False,
):

Import

from tinychat.stream_generators.internvl_stream_gen import InternVLStreamGenerator

I/O Contract

Inputs

Parameter Type Description
model nn.Module InternVL3 model with .stream_gen(), .tokenizer, and .img_context_token_id attributes
gen_params object Generation params with attrs: temp, repeat_penalty, top_p, top_k, n_vocab, n_predict
input str Text prompt with <image> placeholders
media dict or None Dict with "image" and/or "video" keys mapping to lists of tensors
media_cfg dict or None Media configuration metadata
start_pos int Starting KV cache position (default: 0)
device str CUDA device string (default: "cuda:0")
stream_interval int Yield decoded text every N tokens (default: 2)
echo bool If True, include input tokens in output
stop_token_ids list[int] Additional stop token IDs beyond EOS
image_tensor Optional[torch.FloatTensor] Legacy parameter for compatibility (unused)
chunk_prefilling bool Enable chunk prefilling for multi-turn speedup
quant_llm bool Whether the LLM backbone is quantized

Outputs

Yields Type Description
result dict Keys: "text" (str), "usage" (dict with prompt_tokens, completion_tokens, total_tokens), "finish_reason" (None/"stop"/"length"), "timing" (dict or None with context_tokens, context_time, total_tokens, generation_time_list)

InternVL Token Expansion Constants

Constant Value Description
IMG_START_TOKEN <img> Marks the beginning of an image token sequence
IMG_END_TOKEN </img> Marks the end of an image token sequence
IMG_CONTEXT_TOKEN <IMG_CONTEXT> Placeholder repeated for each visual token
NUM_IMAGE_TOKEN 256 Number of visual tokens per image patch

Usage Examples

from tinychat.stream_generators.internvl_stream_gen import InternVLStreamGenerator

# Streaming generation with InternVL3 model (image input)
media = {"image": [image_tensor]}  # preprocessed image patches
for output in InternVLStreamGenerator(
    model=internvl_model,
    gen_params=gen_params,
    input="<image>\nWhat do you see in this image?",
    media=media,
    media_cfg=media_cfg,
    start_pos=0,
    device="cuda:0",
    stop_token_ids=[],
    chunk_prefilling=True,
    quant_llm=True,
):
    if output["finish_reason"] is None:
        print(output["text"], end="", flush=True)
    else:
        print(f"\nFinished: {output['finish_reason']}")
        print(f"Context time: {output['timing']['context_time']:.3f}s")

# Video input example
media = {"video": [video_frames_tensor]}
for output in InternVLStreamGenerator(
    model=internvl_model,
    gen_params=gen_params,
    input="<image>\nDescribe what happens in this video.",
    media=media,
    media_cfg=media_cfg,
    start_pos=0,
    device="cuda:0",
    stop_token_ids=[],
):
    print(output["text"], end="", flush=True)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment