Implementation:Romsto Speculative Decoding AutoModelForCausalLM From Pretrained

Knowledge Sources	HuggingFace AutoModelForCausalLM Speculative Decoding
Domains	NLP, Model_Management
Last Updated	2026-02-14 04:30 GMT

Overview

Wrapper documentation for HuggingFace's AutoModelForCausalLM.from_pretrained and AutoTokenizer.from_pretrained as used in this repository for loading target and drafter models.

Description

This repository uses HuggingFace AutoModelForCausalLM.from_pretrained to load decoder-only causal language models for speculative decoding inference. The default configuration loads Llama 3.2 models with optional int8 quantization via QuantoConfig. Both the target model (e.g., 3B parameters) and drafter model (e.g., 1B parameters) are loaded using the same API with the same device placement.

AutoTokenizer.from_pretrained loads the corresponding tokenizer, which is shared between both models (they must use the same vocabulary).

External Reference

Usage

Use at the beginning of any inference workflow. Load both target and drafter models for standard speculative decoding, or just the target model for NASD. Set device_map="cuda" for GPU inference. Use QuantoConfig for int8 quantization to reduce memory on constrained hardware.

Code Reference

Source Location

Repository: Speculative-Decoding
File: infer.py (usage pattern)
Lines: L89-112

Signature

# HuggingFace API (external)
AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path: str,
    quantization_config: Optional[QuantoConfig] = None,
    device_map: Optional[str] = None,
    trust_remote_code: bool = False,
    **kwargs,
) -> PreTrainedModel

AutoTokenizer.from_pretrained(
    pretrained_model_name_or_path: str,
    trust_remote_code: bool = False,
    **kwargs,
) -> PreTrainedTokenizer

Import

from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig

I/O Contract

Inputs

Name	Type	Required	Description
pretrained_model_name_or_path	str	Yes	HuggingFace model ID (e.g., "meta-llama/Llama-3.2-3B-Instruct") or local path
quantization_config	QuantoConfig	No	Quantization config. Use QuantoConfig(weights="int8") for int8 quantization, or None for full precision.
device_map	str	No	Device placement strategy. Use "cuda" for single GPU.
trust_remote_code	bool	No	Allow executing remote model code. Set True for custom architectures.

Outputs

Name	Type	Description
model	PreTrainedModel	Loaded model in eval mode. Access vocab size via model.config.vocab_size.
tokenizer	PreTrainedTokenizer	Loaded tokenizer with encode/decode/apply_chat_template methods.

Usage Examples

Loading Target and Drafter for Speculative Decoding

from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig

# Target model (larger)
target = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B-Instruct",
    quantization_config=QuantoConfig(weights="int8"),
    device_map="cuda",
    trust_remote_code=True,
)
target.eval()

# Drafter model (smaller, same family)
drafter = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B-Instruct",
    quantization_config=QuantoConfig(weights="int8"),
    device_map="cuda",
    trust_remote_code=True,
)
drafter.eval()

# Shared tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Llama-3.2-3B-Instruct",
    trust_remote_code=True,
)

Loading Target Only for NASD

from transformers import AutoModelForCausalLM, AutoTokenizer

# Only target model needed (n-gram storage replaces drafter)
target = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B-Instruct",
    device_map="cuda",
)
target.eval()

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment