Environment:Romsto Speculative Decoding CUDA PyTorch

Knowledge Sources	Romsto Speculative-Decoding PyTorch HuggingFace Transformers
Domains	Infrastructure, LLMs, Speculative_Decoding
Last Updated	2026-02-14 04:30 GMT

Overview

Linux environment with CUDA-capable GPU, Python 3.7+, PyTorch 2.3.0, and HuggingFace Transformers 4.41.1 for speculative decoding inference.

Description

This environment provides the GPU-accelerated runtime required to run all speculative decoding generation strategies in this repository. It is built around PyTorch 2.3.0 with CUDA support and HuggingFace Transformers 4.41.1 for model loading (including optional int8 quantization via Quanto). The stack also includes bitsandbytes 0.43.1 for quantization support and accelerate 0.30.1 for device mapping. Terminal UI dependencies (rich, tqdm, termcolor) are needed for the interactive CLI.

Usage

Use this environment for any generation task in this repository: autoregressive decoding, speculative decoding, beam search, and N-gram assisted speculative decoding. It is the mandatory prerequisite for running the Speculative_Generate, Autoregressive_Generate, Ngram_Assisted_Speculative_Generate, LogitsProcessor_Hierarchy, Prune_Cache, AutoModelForCausalLM_From_Pretrained, and InferenceCLI implementations.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu recommended)	Windows not tested; macOS CPU-only
Hardware	NVIDIA GPU with CUDA support	Minimum ~8GB VRAM for int8-quantized 3B models; default models are Llama-3.2-3B + 1B
Python	Python >= 3.7	Stated in README
Disk	~20GB free	Model weights for target + drafter downloaded from HuggingFace Hub

Dependencies

System Packages

CUDA toolkit (compatible with PyTorch 2.3.0)
`git` (for cloning repository)

Python Packages

`torch` == 2.3.0 (pinned in requirements.txt)
`transformers` == 4.41.1
`tokenizers` == 0.19.1
`accelerate` == 0.30.1
`bitsandbytes` == 0.43.1
`rich` (any version)
`tqdm` (any version)
`termcolor` (any version)
`numpy` (transitive dependency of torch)

Credentials

The following credentials may be required at runtime:

`HF_TOKEN`: HuggingFace API token (Read access) — required if target/drafter models are gated (e.g., Llama models require acceptance of license terms on HuggingFace Hub).

Quick Install

# Clone the repository
git clone https://github.com/romsto/Speculative-Decoding.git
cd Speculative-Decoding

# Install all required packages
pip install -r requirements.txt

# Or install manually with pinned versions:
pip install torch==2.3.0 transformers==4.41.1 tokenizers==0.19.1 accelerate==0.30.1 bitsandbytes==0.43.1 rich tqdm termcolor

Code Evidence

Device selection from `infer.py:20` and `infer.py:395`:

class InferenceCLI:
    def __init__(self, device: str = "cuda"):
        ...
        self.device = device

# CLI argument default
parser.add_argument("--device", type=str, default="cuda", help="Device to use for inference")

Model loading with quantization from `infer.py:79-108`:

target_quantize = QuantoConfig(weights="int8")  # QuantoConfig(weights="int8")  None
drafter_quantize = QuantoConfig(weights="int8")  # QuantoConfig(weights="int8") None

self.target = AutoModelForCausalLM.from_pretrained(
    target_model,
    quantization_config=target_quantize,
    device_map=self.device,
    trust_remote_code=True,
)

Pinned dependency versions from `requirements.txt`:

torch==2.3.0
transformers==4.41.1
tokenizers==0.19.1
accelerate==0.30.1
bitsandbytes==0.43.1

Max sequence length detection from `sampling/speculative_decoding.py:77`:

max_seq_length = target.config.max_position_embeddings if hasattr(target.config, 'max_position_embeddings') else (target.config.max_context_length if hasattr(target.config, 'max_context_length') else 1024)

Common Errors

Error Message	Cause	Solution
`RuntimeError: CUDA out of memory`	Target + drafter models exceed GPU VRAM	Use `QuantoConfig(weights="int8")` to quantize models, or use smaller models
`OSError: ... is a gated repo`	Llama or other gated model requires HuggingFace authentication	Run `huggingface-cli login` with a valid HF_TOKEN that has accepted the model license
`ImportError: No module named 'bitsandbytes'`	bitsandbytes not installed	`pip install bitsandbytes==0.43.1`
`AssertionError: Prompt length exceeds maximum sequence length`	Input tokens exceed model's `max_position_embeddings`	Reduce prompt length or use a model with larger context window

Compatibility Notes

CUDA required: Default device is `cuda`. CPU inference is possible by passing `--device cpu` but will be significantly slower.
Model pairing: Target and drafter models must share the same tokenizer and output the same logits shape. The default pairing is Llama-3.2-3B-Instruct (target) + Llama-3.2-1B-Instruct (drafter).
Quantization: int8 quantization via QuantoConfig is optional. Set `target_quantize = None` and `drafter_quantize = None` in `infer.py` to disable. Requires more VRAM without quantization.
Encoder-decoder models: Separate `codec_speculative_decoding.py` and `codec_base_decoding.py` modules exist for encoder-decoder architectures but are secondary.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment