Heuristic:Openai CLIP FP16 FP32 Dtype Handling

Knowledge Sources	OpenAI CLIP CLIP model dtype logic
Domains	Optimization, Debugging
Last Updated	2026-02-13 22:00 GMT

Overview

CLIP models use fp16 on GPU for memory efficiency but must be cast to fp32 on CPU to avoid numerical errors; LayerNorm internally computes in fp32 regardless of input dtype.

Description

CLIP model weights are stored in fp16 (half precision) and converted to fp16 during `build_model()` via `convert_weights()`. This halves GPU memory usage. However, CPU execution requires fp32 because many CPU operations do not support fp16 natively. The `clip.load()` function automatically handles this: when `device="cpu"`, it calls `model.float()` to upcast all parameters. Additionally, the custom `LayerNorm` subclass in `model.py` always casts input to fp32 before computing normalization, then casts back to the original dtype, preventing numerical instability in half-precision normalization.

Usage

Be aware of this heuristic when transferring features between GPU and CPU, mixing CLIP with other models, or fine-tuning. If you move a GPU-loaded CLIP model to CPU manually (without using `clip.load()`), you must call `model.float()` yourself. When extracting features for downstream use (e.g., linear probes), the features will be in fp16 on GPU and should be cast to fp32 before passing to numpy or scikit-learn.

The Insight (Rule of Thumb)

Action: Let `clip.load()` handle dtype automatically. On GPU, features are fp16; on CPU, fp32. Always cast to fp32 before numpy conversion.
Value: fp16 on GPU (~50% memory savings), fp32 on CPU (numerical stability).
Trade-off: fp16 saves memory and can be faster on modern GPUs, but has lower numerical precision (not an issue for CLIP inference).
Gotcha: The `encode_image()` method casts input to the model's dtype: `image.type(self.dtype)`. If you pass fp32 images to a GPU model, they are auto-cast to fp16.

Reasoning

Half-precision reduces memory footprint, allowing larger batch sizes during feature extraction. The `convert_weights()` function explicitly converts all Conv, Linear, MultiheadAttention, and projection parameters to fp16. The custom `LayerNorm` prevents the known issue where fp16 layer normalization produces NaN values by temporarily computing in fp32. On CPU, PyTorch's fp16 support is limited (many operations fall back to fp32 anyway), so the upfront cast avoids silent performance degradation.

Code Evidence

Automatic fp32 cast on CPU from `clip/clip.py:138-141`:

if not jit:
    model = build_model(state_dict or model.state_dict()).to(device)
    if str(device) == "cpu":
        model.float()

convert_weights fp16 conversion from `clip/model.py:375-396`:

def convert_weights(model: nn.Module):
    """Convert applicable model parameters to fp16"""
    def _convert_weights_to_fp16(l):
        if isinstance(l, (nn.Conv1d, nn.Conv2d, nn.Linear)):
            l.weight.data = l.weight.data.half()
            if l.bias is not None:
                l.bias.data = l.bias.data.half()
        if isinstance(l, nn.MultiheadAttention):
            for attr in [*[f"{s}_proj_weight" for s in ["in", "q", "k", "v"]], "in_proj_bias", "bias_k", "bias_v"]:
                tensor = getattr(l, attr)
                if tensor is not None:
                    tensor.data = tensor.data.half()

Custom LayerNorm for fp16 safety from `clip/model.py:157-163`:

class LayerNorm(nn.LayerNorm):
    """Subclass torch's LayerNorm to handle fp16."""
    def forward(self, x: torch.Tensor):
        orig_type = x.dtype
        ret = super().forward(x.type(torch.float32))
        return ret.type(orig_type)

Image auto-cast in encode_image from `clip/model.py:340-341`:

def encode_image(self, image):
    return self.visual(image.type(self.dtype))

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment