Principle:Mit han lab Llm awq Vision Transformer Encoding

Knowledge Sources	ViT InternVL
Domains	Vision, Deep_Learning
Last Updated	2026-02-15 00:00 GMT

Overview

Principle of encoding images into dense feature embeddings using Vision Transformer (ViT) architectures with patch-based tokenization.

Description

Vision Transformer Encoding divides an input image into fixed-size patches, projects each patch into an embedding vector, adds positional encodings, and processes the sequence through transformer encoder layers. The output is a sequence of patch features plus a classification (CLS) token. Key innovations include dynamic resolution support via positional embedding interpolation, Flash Attention for efficiency, and QK normalization for training stability.

Usage

Apply this principle when a model needs to extract visual features from images for downstream tasks such as classification, visual question answering, or multimodal generation.

Theoretical Basis

Given an image of size H x W, it is divided into patches of size P x P, yielding N = (H/P) * (W/P) patches. Each patch is linearly projected:

$z_{0} = [x_{c l s}; x_{1} E; x_{2} E; . . .; x_{N} E] + E_{p o s}$

where E is the patch projection matrix and E_pos are positional embeddings. The sequence then passes through L transformer layers with self-attention and MLP blocks.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment