Implementation:Mit han lab Llm awq InternVisionModel

Knowledge Sources	Mit_han_lab_Llm_awq InternVL
Domains	Vision, Model_Architecture
Last Updated	2026-02-15 00:00 GMT

Overview

Concrete tool for encoding images into visual feature embeddings using the InternViT (Intern Vision Transformer) architecture provided by the tinychat framework.

Description

InternVisionModel implements a Vision Transformer (ViT) architecture following the standard design: InternVisionEmbeddings converts images to patch embeddings with positional encoding (supporting dynamic resolution via bicubic interpolation), then InternVisionEncoder processes them through multiple InternVisionEncoderLayer blocks. Each encoder layer applies multi-headed attention (with optional Flash Attention 2 and QK normalization), MLP, layer scaling, and stochastic depth (DropPath). The model supports both InternRMSNorm and standard LayerNorm, with optional Apex FusedRMSNorm acceleration. FlashAttention provides optimized scaled dot-product attention.

Usage

Import InternVisionModel when building the vision encoder component for InternVL3 multimodal models. This class is typically instantiated by the InternVL3 model class and is not used directly by end users.

Code Reference

Source Location

Repository: Mit_han_lab_Llm_awq
File: tinychat/models/internvl/internvit.py
Lines: 1-425

Signature

class InternVisionModel(PreTrainedModel):
    main_input_name = 'pixel_values'
    _supports_flash_attn_2 = True
    supports_gradient_checkpointing = True
    config_class = InternVisionConfig
    _no_split_modules = ['InternVisionEncoderLayer']

    def __init__(self, config: InternVisionConfig):
        """
        Args:
            config: InternVisionConfig with vision transformer parameters.
        """

    def resize_pos_embeddings(self, old_size, new_size, patch_size) -> None:
        """Resize positional embeddings for different image resolutions."""

    def get_input_embeddings(self) -> InternVisionEmbeddings:
        """Return the patch embedding layer."""

    def forward(
        self,
        pixel_values: Optional[torch.FloatTensor] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
        pixel_embeds: Optional[torch.FloatTensor] = None,
    ) -> Union[Tuple, BaseModelOutputWithPooling]:
        """
        Args:
            pixel_values: Input images (batch, channels, height, width).
            output_hidden_states: Whether to return all hidden states.
            return_dict: Whether to return a ModelOutput.
            pixel_embeds: Pre-computed embeddings (skip embedding layer).
        """

Import

from tinychat.models.internvl.internvit import InternVisionModel

I/O Contract

Inputs

Name	Type	Required	Description
pixel_values	torch.FloatTensor	Yes (or pixel_embeds)	Input images of shape (batch, channels, height, width)
output_hidden_states	bool	No	Whether to return hidden states from all layers
return_dict	bool	No	Whether to return a BaseModelOutputWithPooling
pixel_embeds	torch.FloatTensor	No	Pre-computed embeddings to skip the embedding layer

Outputs

Name	Type	Description
last_hidden_state	torch.FloatTensor	Final encoder output of shape (batch, seq_len, hidden_size)
pooler_output	torch.FloatTensor	CLS token output of shape (batch, hidden_size)
hidden_states	Tuple[torch.FloatTensor]	All layer hidden states (if output_hidden_states=True)

Usage Examples

Forward Pass

import torch
from tinychat.models.internvl.internvit import InternVisionModel
from tinychat.models.internvl.configuration_internvl import InternVisionConfig

# Initialize model
config = InternVisionConfig(image_size=448, use_flash_attn=True)
vision_model = InternVisionModel(config).cuda().half()

# Process images
pixel_values = torch.randn(1, 3, 448, 448, device='cuda', dtype=torch.float16)
output = vision_model(pixel_values=pixel_values)

# Access features
features = output.last_hidden_state  # (1, num_patches+1, 3200)
cls_token = output.pooler_output      # (1, 3200)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment