Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mit han lab Llm awq InternVisionModel

From Leeroopedia
Knowledge Sources
Domains Vision, Model_Architecture
Last Updated 2026-02-15 00:00 GMT

Overview

Concrete tool for encoding images into visual feature embeddings using the InternViT (Intern Vision Transformer) architecture provided by the tinychat framework.

Description

InternVisionModel implements a Vision Transformer (ViT) architecture following the standard design: InternVisionEmbeddings converts images to patch embeddings with positional encoding (supporting dynamic resolution via bicubic interpolation), then InternVisionEncoder processes them through multiple InternVisionEncoderLayer blocks. Each encoder layer applies multi-headed attention (with optional Flash Attention 2 and QK normalization), MLP, layer scaling, and stochastic depth (DropPath). The model supports both InternRMSNorm and standard LayerNorm, with optional Apex FusedRMSNorm acceleration. FlashAttention provides optimized scaled dot-product attention.

Usage

Import InternVisionModel when building the vision encoder component for InternVL3 multimodal models. This class is typically instantiated by the InternVL3 model class and is not used directly by end users.

Code Reference

Source Location

Signature

class InternVisionModel(PreTrainedModel):
    main_input_name = 'pixel_values'
    _supports_flash_attn_2 = True
    supports_gradient_checkpointing = True
    config_class = InternVisionConfig
    _no_split_modules = ['InternVisionEncoderLayer']

    def __init__(self, config: InternVisionConfig):
        """
        Args:
            config: InternVisionConfig with vision transformer parameters.
        """

    def resize_pos_embeddings(self, old_size, new_size, patch_size) -> None:
        """Resize positional embeddings for different image resolutions."""

    def get_input_embeddings(self) -> InternVisionEmbeddings:
        """Return the patch embedding layer."""

    def forward(
        self,
        pixel_values: Optional[torch.FloatTensor] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
        pixel_embeds: Optional[torch.FloatTensor] = None,
    ) -> Union[Tuple, BaseModelOutputWithPooling]:
        """
        Args:
            pixel_values: Input images (batch, channels, height, width).
            output_hidden_states: Whether to return all hidden states.
            return_dict: Whether to return a ModelOutput.
            pixel_embeds: Pre-computed embeddings (skip embedding layer).
        """

Import

from tinychat.models.internvl.internvit import InternVisionModel

I/O Contract

Inputs

Name Type Required Description
pixel_values torch.FloatTensor Yes (or pixel_embeds) Input images of shape (batch, channels, height, width)
output_hidden_states bool No Whether to return hidden states from all layers
return_dict bool No Whether to return a BaseModelOutputWithPooling
pixel_embeds torch.FloatTensor No Pre-computed embeddings to skip the embedding layer

Outputs

Name Type Description
last_hidden_state torch.FloatTensor Final encoder output of shape (batch, seq_len, hidden_size)
pooler_output torch.FloatTensor CLS token output of shape (batch, hidden_size)
hidden_states Tuple[torch.FloatTensor] All layer hidden states (if output_hidden_states=True)

Usage Examples

Forward Pass

import torch
from tinychat.models.internvl.internvit import InternVisionModel
from tinychat.models.internvl.configuration_internvl import InternVisionConfig

# Initialize model
config = InternVisionConfig(image_size=448, use_flash_attn=True)
vision_model = InternVisionModel(config).cuda().half()

# Process images
pixel_values = torch.randn(1, 3, 448, 448, device='cuda', dtype=torch.float16)
output = vision_model(pixel_values=pixel_values)

# Access features
features = output.last_hidden_state  # (1, num_patches+1, 3200)
cls_token = output.pooler_output      # (1, 3200)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment