Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Vllm project Vllm Draft Model Acquisition

From Leeroopedia


Knowledge Sources
Domains Model Management, Speculative Decoding, Artifact Resolution
Last Updated 2026-02-08 13:00 GMT

Overview

Acquiring draft model weights is the process of resolving and downloading the auxiliary model artifacts required by a speculative decoding method before inference can begin.

Description

Different speculative decoding methods require different kinds of auxiliary model artifacts:

  • EAGLE / EAGLE3: Requires downloading an EAGLE checkpoint, which is a small head network (typically a single transformer layer plus an embedding layer) trained on top of the target model's hidden states. These checkpoints are hosted on Hugging Face Hub as separate repositories (e.g., yuhuili/EAGLE-LLaMA3.1-Instruct-8B or yuhuili/EAGLE3-LLaMA3.1-Instruct-8B). The checkpoint must match the specific target model it was trained against.
  • Draft Model: Requires downloading a complete smaller language model from the same model family. For example, when the target model is meta-llama/Llama-3.1-8B-Instruct, a suitable draft model might be meta-llama/Llama-3.2-1B-Instruct. The draft model must share the same vocabulary and tokenizer as the target model.
  • N-gram (Prompt Lookup): Requires no additional model download. The n-gram proposer operates purely on the existing token sequence, making it the simplest method to deploy.
  • MTP (Multi-Token Prediction): Requires no additional model download. The MTP heads are part of the target model's own architecture and weights. However, the target model must natively support MTP (e.g., DeepSeek-V3).

The acquisition step is critical for deployment planning because it determines storage requirements, download time, and GPU memory overhead.

Usage

Use this principle when preparing the infrastructure for speculative decoding deployment. Understanding which artifacts are needed helps plan storage provisioning, network bandwidth requirements, and cache warming strategies. For offline or air-gapped deployments, all required model artifacts must be pre-downloaded.

Theoretical Basis

Model Artifact Resolution

The draft model acquisition process follows a resolution chain:

  1. Method selection determines the artifact type (EAGLE checkpoint, full model, or none)
  2. Repository identification resolves the Hugging Face Hub repository ID for the artifact
  3. Snapshot download retrieves all files in the repository to a local cache directory
  4. Path resolution returns the local filesystem path for use in configuration

The huggingface_hub.snapshot_download function handles the actual download, providing:

  • Caching: Files are cached locally and reused on subsequent calls with the same repo_id
  • Integrity verification: Downloaded files are verified against their expected checksums
  • Resumable downloads: Partial downloads can be resumed if interrupted
  • Revision pinning: Specific model versions can be requested via commit hash or tag

Memory Budget Considerations

When planning GPU memory allocation for speculative decoding:

Total GPU Memory = Target Model Weights + KV Cache + Draft Model Weights + Overhead

Where Draft Model Weights varies by method:
- EAGLE head:   ~100-500 MB (single transformer layer)
- Draft model:  ~1-4 GB (full small model)
- N-gram:       0 MB
- MTP:          0 MB (included in target model weights)

The tradeoff is that methods requiring more memory (draft model) may achieve lower latency per draft token compared to lighter-weight methods (EAGLE), but consume a larger fraction of the available KV cache budget.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment