Heuristic:Openai CLIP FP16 FP32 Dtype Handling
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Debugging |
| Last Updated | 2026-02-13 22:00 GMT |
Overview
CLIP models use fp16 on GPU for memory efficiency but must be cast to fp32 on CPU to avoid numerical errors; LayerNorm internally computes in fp32 regardless of input dtype.
Description
CLIP model weights are stored in fp16 (half precision) and converted to fp16 during `build_model()` via `convert_weights()`. This halves GPU memory usage. However, CPU execution requires fp32 because many CPU operations do not support fp16 natively. The `clip.load()` function automatically handles this: when `device="cpu"`, it calls `model.float()` to upcast all parameters. Additionally, the custom `LayerNorm` subclass in `model.py` always casts input to fp32 before computing normalization, then casts back to the original dtype, preventing numerical instability in half-precision normalization.
Usage
Be aware of this heuristic when transferring features between GPU and CPU, mixing CLIP with other models, or fine-tuning. If you move a GPU-loaded CLIP model to CPU manually (without using `clip.load()`), you must call `model.float()` yourself. When extracting features for downstream use (e.g., linear probes), the features will be in fp16 on GPU and should be cast to fp32 before passing to numpy or scikit-learn.
The Insight (Rule of Thumb)
- Action: Let `clip.load()` handle dtype automatically. On GPU, features are fp16; on CPU, fp32. Always cast to fp32 before numpy conversion.
- Value: fp16 on GPU (~50% memory savings), fp32 on CPU (numerical stability).
- Trade-off: fp16 saves memory and can be faster on modern GPUs, but has lower numerical precision (not an issue for CLIP inference).
- Gotcha: The `encode_image()` method casts input to the model's dtype: `image.type(self.dtype)`. If you pass fp32 images to a GPU model, they are auto-cast to fp16.
Reasoning
Half-precision reduces memory footprint, allowing larger batch sizes during feature extraction. The `convert_weights()` function explicitly converts all Conv, Linear, MultiheadAttention, and projection parameters to fp16. The custom `LayerNorm` prevents the known issue where fp16 layer normalization produces NaN values by temporarily computing in fp32. On CPU, PyTorch's fp16 support is limited (many operations fall back to fp32 anyway), so the upfront cast avoids silent performance degradation.
Code Evidence
Automatic fp32 cast on CPU from `clip/clip.py:138-141`:
if not jit:
model = build_model(state_dict or model.state_dict()).to(device)
if str(device) == "cpu":
model.float()
convert_weights fp16 conversion from `clip/model.py:375-396`:
def convert_weights(model: nn.Module):
"""Convert applicable model parameters to fp16"""
def _convert_weights_to_fp16(l):
if isinstance(l, (nn.Conv1d, nn.Conv2d, nn.Linear)):
l.weight.data = l.weight.data.half()
if l.bias is not None:
l.bias.data = l.bias.data.half()
if isinstance(l, nn.MultiheadAttention):
for attr in [*[f"{s}_proj_weight" for s in ["in", "q", "k", "v"]], "in_proj_bias", "bias_k", "bias_v"]:
tensor = getattr(l, attr)
if tensor is not None:
tensor.data = tensor.data.half()
Custom LayerNorm for fp16 safety from `clip/model.py:157-163`:
class LayerNorm(nn.LayerNorm):
"""Subclass torch's LayerNorm to handle fp16."""
def forward(self, x: torch.Tensor):
orig_type = x.dtype
ret = super().forward(x.type(torch.float32))
return ret.type(orig_type)
Image auto-cast in encode_image from `clip/model.py:340-341`:
def encode_image(self, image):
return self.visual(image.type(self.dtype))