Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Romsto Speculative Decoding AutoModelForCausalLM From Pretrained

From Leeroopedia
Knowledge Sources
Domains NLP, Model_Management
Last Updated 2026-02-14 04:30 GMT

Overview

Wrapper documentation for HuggingFace's AutoModelForCausalLM.from_pretrained and AutoTokenizer.from_pretrained as used in this repository for loading target and drafter models.

Description

This repository uses HuggingFace AutoModelForCausalLM.from_pretrained to load decoder-only causal language models for speculative decoding inference. The default configuration loads Llama 3.2 models with optional int8 quantization via QuantoConfig. Both the target model (e.g., 3B parameters) and drafter model (e.g., 1B parameters) are loaded using the same API with the same device placement.

AutoTokenizer.from_pretrained loads the corresponding tokenizer, which is shared between both models (they must use the same vocabulary).

External Reference

Usage

Use at the beginning of any inference workflow. Load both target and drafter models for standard speculative decoding, or just the target model for NASD. Set device_map="cuda" for GPU inference. Use QuantoConfig for int8 quantization to reduce memory on constrained hardware.

Code Reference

Source Location

Signature

# HuggingFace API (external)
AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path: str,
    quantization_config: Optional[QuantoConfig] = None,
    device_map: Optional[str] = None,
    trust_remote_code: bool = False,
    **kwargs,
) -> PreTrainedModel

AutoTokenizer.from_pretrained(
    pretrained_model_name_or_path: str,
    trust_remote_code: bool = False,
    **kwargs,
) -> PreTrainedTokenizer

Import

from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig

I/O Contract

Inputs

Name Type Required Description
pretrained_model_name_or_path str Yes HuggingFace model ID (e.g., "meta-llama/Llama-3.2-3B-Instruct") or local path
quantization_config QuantoConfig No Quantization config. Use QuantoConfig(weights="int8") for int8 quantization, or None for full precision.
device_map str No Device placement strategy. Use "cuda" for single GPU.
trust_remote_code bool No Allow executing remote model code. Set True for custom architectures.

Outputs

Name Type Description
model PreTrainedModel Loaded model in eval mode. Access vocab size via model.config.vocab_size.
tokenizer PreTrainedTokenizer Loaded tokenizer with encode/decode/apply_chat_template methods.

Usage Examples

Loading Target and Drafter for Speculative Decoding

from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig

# Target model (larger)
target = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B-Instruct",
    quantization_config=QuantoConfig(weights="int8"),
    device_map="cuda",
    trust_remote_code=True,
)
target.eval()

# Drafter model (smaller, same family)
drafter = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B-Instruct",
    quantization_config=QuantoConfig(weights="int8"),
    device_map="cuda",
    trust_remote_code=True,
)
drafter.eval()

# Shared tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Llama-3.2-3B-Instruct",
    trust_remote_code=True,
)

Loading Target Only for NASD

from transformers import AutoModelForCausalLM, AutoTokenizer

# Only target model needed (n-gram storage replaces drafter)
target = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B-Instruct",
    device_map="cuda",
)
target.eval()

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment