Principle:Lm sys FastChat Model Weight Delta Distribution
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | Model Weight Delta Distribution |
| Repository | lm-sys/FastChat |
| Workflow | Model_Release |
| Domains | Model_Distribution, Licensing |
| Knowledge Sources | fastchat/model/apply_delta.py, fastchat/model/make_delta.py |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
This principle addresses the technique of distributing fine-tuned model weights as deltas (differences) relative to a base model, rather than distributing the full fine-tuned weights directly. This approach is motivated by licensing constraints: when base models have restrictive licenses that prohibit redistribution, the delta method allows sharing fine-tuning artifacts without violating those terms. Users who already have authorized access to the base model can reconstruct the full fine-tuned weights by applying the delta.
Description
Creating Deltas by Subtracting Base Model Weights
The delta creation process loads both the fine-tuned model and the original base model, then computes the element-wise difference for every parameter tensor:
delta[name] = finetuned_params[name] - base_params[name]
This produces a set of tensors that represent only the changes introduced by fine-tuning. The resulting delta checkpoint is typically much more compressible than the full model, since many parameter differences are near-zero, especially for parameters that were minimally affected by fine-tuning.
Applying Deltas by Addition
To reconstruct the fine-tuned model, a user loads the base model and the delta checkpoint, then adds them element-wise:
reconstructed[name] = base_params[name] + delta[name]
The result is mathematically identical to the original fine-tuned model (up to floating-point precision). This operation is straightforward and requires no specialized tooling beyond standard tensor operations.
Handling Tokenizer Vocabulary Differences
Fine-tuning sometimes introduces new tokens to the vocabulary (e.g., special tokens for conversation formatting). When this occurs, the embedding and language model head matrices in the fine-tuned model have different dimensions than the base model. The delta process must account for this by:
- Padding the base model's embedding matrices with zeros to match the fine-tuned model's vocabulary size before subtraction.
- Storing the expanded vocabulary in the delta checkpoint so that applying the delta correctly reconstructs the larger embedding layer.
Low-CPU-Memory Mode for Large Models
For models with billions of parameters (7B, 13B, 33B, 65B), loading two full copies simultaneously may exceed available system RAM. The low-CPU-memory mode addresses this by loading the base model with each parameter tensor mapped to the meta device, then materializing and processing parameters one at a time. This reduces peak memory usage from O(2 * model_size) to approximately O(model_size), making delta application feasible on machines with limited RAM.
Theoretical Basis
Delta distribution exploits the linearity of weight space. Given a base model with parameters W_base and a fine-tuned model with parameters W_finetuned, the delta is defined as:
Delta = W_finetuned - W_base
Since neural network parameters live in a Euclidean vector space, the reconstruction W_base + Delta = W_finetuned is exact (within floating-point precision). This property is independent of the model architecture, training procedure, or hyperparameters -- it is a fundamental consequence of the arithmetic of real-valued tensors. The approach is conceptually related to techniques in version control (storing diffs rather than full snapshots) and compression (delta encoding), where representing changes relative to a reference achieves significant storage savings. In the context of model distribution, it provides a legal mechanism for sharing fine-tuning results without redistributing proprietary base model weights.