Principle:Lm sys FastChat Vision Image Processing
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | Vision Image Processing |
| Repository | lm-sys/FastChat |
| Workflow | Vision_Serving |
| Domains | Vision, Image_Processing |
| Knowledge Sources | fastchat/utils.py, fastchat/serve/vision/image.py |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
This principle describes the standardized approach to image encoding, decoding, format conversion, and resizing required for serving vision-language models. As multimodal models that accept both text and images become increasingly common, a consistent image preprocessing interface ensures reliable behavior across different model backends, transport protocols, and client implementations.
Description
ImageFormat Enumeration
A well-defined set of supported image formats is represented as an enumeration (e.g., PNG, JPEG, WEBP, GIF). Each format has distinct characteristics relevant to vision-language model serving:
- PNG: Lossless compression, supports transparency. Suitable for images where pixel-exact fidelity is required, but produces larger payloads.
- JPEG: Lossy compression with configurable quality. Produces smaller payloads and is the most common format for photographic content.
- WEBP: Modern format supporting both lossy and lossless compression with superior compression ratios. Increasingly supported by browsers and vision models.
- GIF: Supports animation and transparency. Relevant for multi-frame inputs, though most vision models process only the first frame.
Using an enumeration rather than raw strings prevents format mismatches, typos, and unsupported format errors.
Base64 Encoding and Decoding
Images transmitted over HTTP APIs and WebSocket connections are typically base64-encoded to embed binary image data within JSON payloads. The encoding/decoding pipeline follows a standard flow:
Image bytes -> base64 encode -> JSON string (transmission) -> base64 decode -> Image bytes
Data URIs (e.g., data:image/png;base64,iVBOR...) provide a self-describing format that embeds both the MIME type and the encoded data in a single string. The processing pipeline must parse these URIs to extract both the format metadata and the raw image data.
Format Detection from Data URIs
When an image arrives as a data URI, the format is determined by parsing the MIME type prefix (e.g., image/png, image/jpeg). This detection is essential because different vision model backends may require images in specific formats, necessitating format conversion before inference. Robust format detection also handles edge cases such as missing MIME types (defaulting to JPEG) or non-standard URI schemes.
Image Resizing with Aspect Ratio Preservation
Vision-language models typically expect images at specific resolutions (e.g., 336x336 for CLIP-based models, 448x448 for others). Images must be resized to meet these requirements while preserving the original aspect ratio to avoid distortion:
- The image is scaled so that its longest (or shortest, depending on the model's requirements) side matches the target dimension.
- Padding or center-cropping may be applied to achieve an exact square resolution if required.
- Downsampling uses high-quality interpolation (e.g., Lanczos or bicubic) to minimize aliasing artifacts.
This preprocessing step is critical because vision transformers are sensitive to input resolution and aspect ratio distortion can degrade recognition performance.
Theoretical Basis
Vision-language models require images in specific formats and resolutions; standardizing image preprocessing into a common interface ensures consistent behavior across different vision model backends and transport protocols (HTTP, WebSocket). The theoretical motivation draws from two domains. First, from computer vision: image preprocessing (normalization, resizing, format conversion) has long been recognized as a critical pipeline stage that can significantly impact model performance. Inconsistent preprocessing is a common source of accuracy degradation in production systems. Second, from software architecture: the adapter pattern provides a uniform interface over heterogeneous inputs. By abstracting image format handling, encoding, and resizing into a standardized module, the serving system can support diverse client implementations (web browsers, mobile apps, API clients) and model backends (CLIP, LLaVA, CogVLM) without duplicating preprocessing logic. This separation of concerns improves maintainability and reduces the surface area for format-related bugs.