Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Huggingface Peft DoRA Inference Caching

From Leeroopedia




Knowledge Sources
Domains Optimization, Inference, LLMs
Last Updated 2026-02-07 06:44 GMT

Overview

Enable `ENABLE_DORA_CACHING = True` to speed up DoRA inference by caching weight norms and LoRA weights, trading memory for latency.

Description

DoRA (Weight-Decomposed Low-Rank Adaptation) decomposes weight updates into magnitude and direction components. During inference, the weight norm (`L2 norm, column-wise`) and the LoRA weight (`lora_B @ lora_A`) must be computed for each forward pass. The `ENABLE_DORA_CACHING` flag, when set to `True`, caches these intermediate values between forward passes in eval mode. The cache is keyed by adapter name and is automatically cleared when entering training mode.

Usage

Use this heuristic when:

  • Running DoRA inference (not training) and latency matters
  • Serving DoRA models in production with repeated forward passes
  • The model is in `eval()` mode and adapter names are consistent

Do NOT enable during training, as the cache would hold stale values during gradient updates.

The Insight (Rule of Thumb)

  • Action: Set `peft.tuners.lora.dora.ENABLE_DORA_CACHING = True` before inference.
  • Value: Caches `weight-norm` and `lora-weight` per adapter per layer.
  • Trade-off: Increases memory by storing cached tensors (one weight norm + one LoRA weight per DoRA layer per adapter). Speeds up inference by avoiding redundant norm and matrix computations.
  • Compatibility: Only active in eval mode. Automatically disabled during training. Works with all models using `use_dora=True`.

Reasoning

DoRA computes `weight_norm = ||W + scaling * (lora_B @ lora_A)||_2` per column at each forward pass. For large models, this L2 norm computation is non-trivial. During inference, since weights do not change, caching these values eliminates redundant computation. The implementation uses a decorator pattern that checks `self.training` and the global `ENABLE_DORA_CACHING` flag before consulting the cache.

Additionally, the LoRA weight computation avoids direct matrix multiplication `lora_B.weight @ lora_A.weight` because this causes errors with FSDP (Fully Sharded Data Parallel). Instead, it calculates the equivalent result using forward passes, which is another reason caching is beneficial.

Code Evidence

Caching flag definition from `src/peft/tuners/lora/dora.py:27-28`:

ENABLE_DORA_CACHING = False
"""Whether to enable DoRA caching, which makes it faster at
inference but requires more memory"""

Cache decorator from `src/peft/tuners/lora/dora.py:31-50`:

def cache_decorator(cache_key: str):
    """Caching decorator for DoRA

    Caching is only enabled if ENABLE_DORA_CACHING is set to True
    (default: False), when in eval mode, and when the adapter_name
    is passed (e.g. not during layer initialization).
    """
    def cache_value(func):
        @wraps(func)
        def wrapper(self, *args, **kwargs):
            adapter_name = kwargs.get("adapter_name")
            if (not ENABLE_DORA_CACHING) or self.training or \
               (adapter_name is None):
                self._cache_clear()
                return func(self, *args, **kwargs)
            cache_key_adapter = f"{cache_key}-{adapter_name}"
            output = self._cache_get(cache_key_adapter, None)
            if output is not None:
                return output
            ...

DoRA config option from `src/peft/tuners/lora/config.py:420-426`:

use_dora: bool
    # Enable 'Weight-Decomposed Low-Rank Adaptation' (DoRA).
    # This technique decomposes the updates of the weights into
    # two parts, magnitude and direction.
    # DoRA introduces a bigger overhead than pure LoRA, so it is
    # recommended to merge weights for inference.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment