Implementation:Mit han lab Llm awq InternVisionModel
| Knowledge Sources | |
|---|---|
| Domains | Vision, Model_Architecture |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Concrete tool for encoding images into visual feature embeddings using the InternViT (Intern Vision Transformer) architecture provided by the tinychat framework.
Description
InternVisionModel implements a Vision Transformer (ViT) architecture following the standard design: InternVisionEmbeddings converts images to patch embeddings with positional encoding (supporting dynamic resolution via bicubic interpolation), then InternVisionEncoder processes them through multiple InternVisionEncoderLayer blocks. Each encoder layer applies multi-headed attention (with optional Flash Attention 2 and QK normalization), MLP, layer scaling, and stochastic depth (DropPath). The model supports both InternRMSNorm and standard LayerNorm, with optional Apex FusedRMSNorm acceleration. FlashAttention provides optimized scaled dot-product attention.
Usage
Import InternVisionModel when building the vision encoder component for InternVL3 multimodal models. This class is typically instantiated by the InternVL3 model class and is not used directly by end users.
Code Reference
Source Location
- Repository: Mit_han_lab_Llm_awq
- File: tinychat/models/internvl/internvit.py
- Lines: 1-425
Signature
class InternVisionModel(PreTrainedModel):
main_input_name = 'pixel_values'
_supports_flash_attn_2 = True
supports_gradient_checkpointing = True
config_class = InternVisionConfig
_no_split_modules = ['InternVisionEncoderLayer']
def __init__(self, config: InternVisionConfig):
"""
Args:
config: InternVisionConfig with vision transformer parameters.
"""
def resize_pos_embeddings(self, old_size, new_size, patch_size) -> None:
"""Resize positional embeddings for different image resolutions."""
def get_input_embeddings(self) -> InternVisionEmbeddings:
"""Return the patch embedding layer."""
def forward(
self,
pixel_values: Optional[torch.FloatTensor] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
pixel_embeds: Optional[torch.FloatTensor] = None,
) -> Union[Tuple, BaseModelOutputWithPooling]:
"""
Args:
pixel_values: Input images (batch, channels, height, width).
output_hidden_states: Whether to return all hidden states.
return_dict: Whether to return a ModelOutput.
pixel_embeds: Pre-computed embeddings (skip embedding layer).
"""
Import
from tinychat.models.internvl.internvit import InternVisionModel
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| pixel_values | torch.FloatTensor | Yes (or pixel_embeds) | Input images of shape (batch, channels, height, width) |
| output_hidden_states | bool | No | Whether to return hidden states from all layers |
| return_dict | bool | No | Whether to return a BaseModelOutputWithPooling |
| pixel_embeds | torch.FloatTensor | No | Pre-computed embeddings to skip the embedding layer |
Outputs
| Name | Type | Description |
|---|---|---|
| last_hidden_state | torch.FloatTensor | Final encoder output of shape (batch, seq_len, hidden_size) |
| pooler_output | torch.FloatTensor | CLS token output of shape (batch, hidden_size) |
| hidden_states | Tuple[torch.FloatTensor] | All layer hidden states (if output_hidden_states=True) |
Usage Examples
Forward Pass
import torch
from tinychat.models.internvl.internvit import InternVisionModel
from tinychat.models.internvl.configuration_internvl import InternVisionConfig
# Initialize model
config = InternVisionConfig(image_size=448, use_flash_attn=True)
vision_model = InternVisionModel(config).cuda().half()
# Process images
pixel_values = torch.randn(1, 3, 448, 448, device='cuda', dtype=torch.float16)
output = vision_model(pixel_values=pixel_values)
# Access features
features = output.last_hidden_state # (1, num_patches+1, 3200)
cls_token = output.pooler_output # (1, 3200)