Implementation:Romsto Speculative Decoding AutoModelForCausalLM From Pretrained
| Knowledge Sources | |
|---|---|
| Domains | NLP, Model_Management |
| Last Updated | 2026-02-14 04:30 GMT |
Overview
Wrapper documentation for HuggingFace's AutoModelForCausalLM.from_pretrained and AutoTokenizer.from_pretrained as used in this repository for loading target and drafter models.
Description
This repository uses HuggingFace AutoModelForCausalLM.from_pretrained to load decoder-only causal language models for speculative decoding inference. The default configuration loads Llama 3.2 models with optional int8 quantization via QuantoConfig. Both the target model (e.g., 3B parameters) and drafter model (e.g., 1B parameters) are loaded using the same API with the same device placement.
AutoTokenizer.from_pretrained loads the corresponding tokenizer, which is shared between both models (they must use the same vocabulary).
External Reference
Usage
Use at the beginning of any inference workflow. Load both target and drafter models for standard speculative decoding, or just the target model for NASD. Set device_map="cuda" for GPU inference. Use QuantoConfig for int8 quantization to reduce memory on constrained hardware.
Code Reference
Source Location
- Repository: Speculative-Decoding
- File: infer.py (usage pattern)
- Lines: L89-112
Signature
# HuggingFace API (external)
AutoModelForCausalLM.from_pretrained(
pretrained_model_name_or_path: str,
quantization_config: Optional[QuantoConfig] = None,
device_map: Optional[str] = None,
trust_remote_code: bool = False,
**kwargs,
) -> PreTrainedModel
AutoTokenizer.from_pretrained(
pretrained_model_name_or_path: str,
trust_remote_code: bool = False,
**kwargs,
) -> PreTrainedTokenizer
Import
from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| pretrained_model_name_or_path | str | Yes | HuggingFace model ID (e.g., "meta-llama/Llama-3.2-3B-Instruct") or local path |
| quantization_config | QuantoConfig | No | Quantization config. Use QuantoConfig(weights="int8") for int8 quantization, or None for full precision. |
| device_map | str | No | Device placement strategy. Use "cuda" for single GPU. |
| trust_remote_code | bool | No | Allow executing remote model code. Set True for custom architectures. |
Outputs
| Name | Type | Description |
|---|---|---|
| model | PreTrainedModel | Loaded model in eval mode. Access vocab size via model.config.vocab_size. |
| tokenizer | PreTrainedTokenizer | Loaded tokenizer with encode/decode/apply_chat_template methods. |
Usage Examples
Loading Target and Drafter for Speculative Decoding
from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig
# Target model (larger)
target = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-3B-Instruct",
quantization_config=QuantoConfig(weights="int8"),
device_map="cuda",
trust_remote_code=True,
)
target.eval()
# Drafter model (smaller, same family)
drafter = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B-Instruct",
quantization_config=QuantoConfig(weights="int8"),
device_map="cuda",
trust_remote_code=True,
)
drafter.eval()
# Shared tokenizer
tokenizer = AutoTokenizer.from_pretrained(
"meta-llama/Llama-3.2-3B-Instruct",
trust_remote_code=True,
)
Loading Target Only for NASD
from transformers import AutoModelForCausalLM, AutoTokenizer
# Only target model needed (n-gram storage replaces drafter)
target = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-3B-Instruct",
device_map="cuda",
)
target.eval()
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")