Principle:Mit han lab Llm awq Vision Transformer Encoding
| Knowledge Sources | |
|---|---|
| Domains | Vision, Deep_Learning |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Principle of encoding images into dense feature embeddings using Vision Transformer (ViT) architectures with patch-based tokenization.
Description
Vision Transformer Encoding divides an input image into fixed-size patches, projects each patch into an embedding vector, adds positional encodings, and processes the sequence through transformer encoder layers. The output is a sequence of patch features plus a classification (CLS) token. Key innovations include dynamic resolution support via positional embedding interpolation, Flash Attention for efficiency, and QK normalization for training stability.
Usage
Apply this principle when a model needs to extract visual features from images for downstream tasks such as classification, visual question answering, or multimodal generation.
Theoretical Basis
Given an image of size H x W, it is divided into patches of size P x P, yielding N = (H/P) * (W/P) patches. Each patch is linearly projected:
where E is the patch projection matrix and E_pos are positional embeddings. The sequence then passes through L transformer layers with self-attention and MLP blocks.