Implementation:Mit han lab Llm awq InternVLStreamGenerator
| Knowledge Sources | |
|---|---|
| Domains | NLP, Inference |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Token-by-token streaming text generation for InternVL3 multimodal models, handling both image and video media through special token placeholders (IMG_START_TOKEN, IMG_END_TOKEN, IMG_CONTEXT_TOKEN).
Description
InternVLStreamGenerator is a generator function decorated with @torch.inference_mode() that provides streaming inference for InternVL3 vision-language models. It extends the common streaming pattern used by other generators in the codebase but adds InternVL-specific media token expansion for both images and videos.
Before tokenization, the function expands <image> placeholders in the input text into sequences of InternVL-specific tokens. For each image or video in the media dictionary, it computes num_patches from the tensor size, then constructs a token string of the form <img> + IMG_CONTEXT_TOKEN * (NUM_IMAGE_TOKEN * num_patches) + </img>, where NUM_IMAGE_TOKEN is fixed at 256. This token string replaces the corresponding <image> placeholder in order. The img_context_token_id is set on the model by converting IMG_CONTEXT_TOKEN via the tokenizer, enabling the model to identify visual token positions.
The function handles both "image" and "video" keys in the media dictionary independently, processing each with the same patch-based token expansion logic. This unified approach allows InternVL3 to handle mixed media types within a single prompt.
Like NVILAStreamGenerator, it delegates forward passes to model.stream_gen(), which returns logits and consumed sequence length. Logits processing uses prepare_logits_processor from llava_stream_gen for temperature scaling, top-p, and top-k filtering. The function includes numerical stability checks for Inf values in logits and Inf/NaN in probabilities.
For multi-turn conversations with chunk_prefilling, the input is prepended with <|im_start|> when start_pos is non-zero, consistent with the Qwen2 chat template used by InternVL3's language backbone.
Usage
Use this generator for streaming inference with InternVL3 models in the internvl_demo.py interactive demo. It is set as the stream generator and called in the main chat loop with preprocessed media and media configuration.
Code Reference
Source Location
- Repository: Mit_han_lab_Llm_awq
- File: tinychat/stream_generators/internvl_stream_gen.py
- Lines: 1-204
Signature
@torch.inference_mode()
def InternVLStreamGenerator(
model,
gen_params,
input: str,
media=None,
media_cfg=None,
start_pos: int = 0,
device: str = "cuda:0",
stream_interval: int = 2,
echo: bool = False,
stop_token_ids=[],
image_tensor: Optional[torch.FloatTensor] = None,
chunk_prefilling: bool = False,
quant_llm: bool = False,
):
Import
from tinychat.stream_generators.internvl_stream_gen import InternVLStreamGenerator
I/O Contract
Inputs
| Parameter | Type | Description |
|---|---|---|
| model | nn.Module | InternVL3 model with .stream_gen(), .tokenizer, and .img_context_token_id attributes |
| gen_params | object | Generation params with attrs: temp, repeat_penalty, top_p, top_k, n_vocab, n_predict |
| input | str | Text prompt with <image> placeholders
|
| media | dict or None | Dict with "image" and/or "video" keys mapping to lists of tensors |
| media_cfg | dict or None | Media configuration metadata |
| start_pos | int | Starting KV cache position (default: 0) |
| device | str | CUDA device string (default: "cuda:0") |
| stream_interval | int | Yield decoded text every N tokens (default: 2) |
| echo | bool | If True, include input tokens in output |
| stop_token_ids | list[int] | Additional stop token IDs beyond EOS |
| image_tensor | Optional[torch.FloatTensor] | Legacy parameter for compatibility (unused) |
| chunk_prefilling | bool | Enable chunk prefilling for multi-turn speedup |
| quant_llm | bool | Whether the LLM backbone is quantized |
Outputs
| Yields | Type | Description |
|---|---|---|
| result | dict | Keys: "text" (str), "usage" (dict with prompt_tokens, completion_tokens, total_tokens), "finish_reason" (None/"stop"/"length"), "timing" (dict or None with context_tokens, context_time, total_tokens, generation_time_list) |
InternVL Token Expansion Constants
| Constant | Value | Description |
|---|---|---|
| IMG_START_TOKEN | <img> |
Marks the beginning of an image token sequence |
| IMG_END_TOKEN | </img> |
Marks the end of an image token sequence |
| IMG_CONTEXT_TOKEN | <IMG_CONTEXT> |
Placeholder repeated for each visual token |
| NUM_IMAGE_TOKEN | 256 | Number of visual tokens per image patch |
Usage Examples
from tinychat.stream_generators.internvl_stream_gen import InternVLStreamGenerator
# Streaming generation with InternVL3 model (image input)
media = {"image": [image_tensor]} # preprocessed image patches
for output in InternVLStreamGenerator(
model=internvl_model,
gen_params=gen_params,
input="<image>\nWhat do you see in this image?",
media=media,
media_cfg=media_cfg,
start_pos=0,
device="cuda:0",
stop_token_ids=[],
chunk_prefilling=True,
quant_llm=True,
):
if output["finish_reason"] is None:
print(output["text"], end="", flush=True)
else:
print(f"\nFinished: {output['finish_reason']}")
print(f"Context time: {output['timing']['context_time']:.3f}s")
# Video input example
media = {"video": [video_frames_tensor]}
for output in InternVLStreamGenerator(
model=internvl_model,
gen_params=gen_params,
input="<image>\nDescribe what happens in this video.",
media=media,
media_cfg=media_cfg,
start_pos=0,
device="cuda:0",
stop_token_ids=[],
):
print(output["text"], end="", flush=True)