Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Romsto Speculative Decoding CUDA PyTorch

From Leeroopedia
Knowledge Sources
Domains Infrastructure, LLMs, Speculative_Decoding
Last Updated 2026-02-14 04:30 GMT

Overview

Linux environment with CUDA-capable GPU, Python 3.7+, PyTorch 2.3.0, and HuggingFace Transformers 4.41.1 for speculative decoding inference.

Description

This environment provides the GPU-accelerated runtime required to run all speculative decoding generation strategies in this repository. It is built around PyTorch 2.3.0 with CUDA support and HuggingFace Transformers 4.41.1 for model loading (including optional int8 quantization via Quanto). The stack also includes bitsandbytes 0.43.1 for quantization support and accelerate 0.30.1 for device mapping. Terminal UI dependencies (rich, tqdm, termcolor) are needed for the interactive CLI.

Usage

Use this environment for any generation task in this repository: autoregressive decoding, speculative decoding, beam search, and N-gram assisted speculative decoding. It is the mandatory prerequisite for running the Speculative_Generate, Autoregressive_Generate, Ngram_Assisted_Speculative_Generate, LogitsProcessor_Hierarchy, Prune_Cache, AutoModelForCausalLM_From_Pretrained, and InferenceCLI implementations.

System Requirements

Category Requirement Notes
OS Linux (Ubuntu recommended) Windows not tested; macOS CPU-only
Hardware NVIDIA GPU with CUDA support Minimum ~8GB VRAM for int8-quantized 3B models; default models are Llama-3.2-3B + 1B
Python Python >= 3.7 Stated in README
Disk ~20GB free Model weights for target + drafter downloaded from HuggingFace Hub

Dependencies

System Packages

  • CUDA toolkit (compatible with PyTorch 2.3.0)
  • `git` (for cloning repository)

Python Packages

  • `torch` == 2.3.0 (pinned in requirements.txt)
  • `transformers` == 4.41.1
  • `tokenizers` == 0.19.1
  • `accelerate` == 0.30.1
  • `bitsandbytes` == 0.43.1
  • `rich` (any version)
  • `tqdm` (any version)
  • `termcolor` (any version)
  • `numpy` (transitive dependency of torch)

Credentials

The following credentials may be required at runtime:

  • `HF_TOKEN`: HuggingFace API token (Read access) — required if target/drafter models are gated (e.g., Llama models require acceptance of license terms on HuggingFace Hub).

Quick Install

# Clone the repository
git clone https://github.com/romsto/Speculative-Decoding.git
cd Speculative-Decoding

# Install all required packages
pip install -r requirements.txt

# Or install manually with pinned versions:
pip install torch==2.3.0 transformers==4.41.1 tokenizers==0.19.1 accelerate==0.30.1 bitsandbytes==0.43.1 rich tqdm termcolor

Code Evidence

Device selection from `infer.py:20` and `infer.py:395`:

class InferenceCLI:
    def __init__(self, device: str = "cuda"):
        ...
        self.device = device

# CLI argument default
parser.add_argument("--device", type=str, default="cuda", help="Device to use for inference")

Model loading with quantization from `infer.py:79-108`:

target_quantize = QuantoConfig(weights="int8")  # QuantoConfig(weights="int8")  None
drafter_quantize = QuantoConfig(weights="int8")  # QuantoConfig(weights="int8") None

self.target = AutoModelForCausalLM.from_pretrained(
    target_model,
    quantization_config=target_quantize,
    device_map=self.device,
    trust_remote_code=True,
)

Pinned dependency versions from `requirements.txt`:

torch==2.3.0
transformers==4.41.1
tokenizers==0.19.1
accelerate==0.30.1
bitsandbytes==0.43.1

Max sequence length detection from `sampling/speculative_decoding.py:77`:

max_seq_length = target.config.max_position_embeddings if hasattr(target.config, 'max_position_embeddings') else (target.config.max_context_length if hasattr(target.config, 'max_context_length') else 1024)

Common Errors

Error Message Cause Solution
`RuntimeError: CUDA out of memory` Target + drafter models exceed GPU VRAM Use `QuantoConfig(weights="int8")` to quantize models, or use smaller models
`OSError: ... is a gated repo` Llama or other gated model requires HuggingFace authentication Run `huggingface-cli login` with a valid HF_TOKEN that has accepted the model license
`ImportError: No module named 'bitsandbytes'` bitsandbytes not installed `pip install bitsandbytes==0.43.1`
`AssertionError: Prompt length exceeds maximum sequence length` Input tokens exceed model's `max_position_embeddings` Reduce prompt length or use a model with larger context window

Compatibility Notes

  • CUDA required: Default device is `cuda`. CPU inference is possible by passing `--device cpu` but will be significantly slower.
  • Model pairing: Target and drafter models must share the same tokenizer and output the same logits shape. The default pairing is Llama-3.2-3B-Instruct (target) + Llama-3.2-1B-Instruct (drafter).
  • Quantization: int8 quantization via QuantoConfig is optional. Set `target_quantize = None` and `drafter_quantize = None` in `infer.py` to disable. Requires more VRAM without quantization.
  • Encoder-decoder models: Separate `codec_speculative_decoding.py` and `codec_base_decoding.py` modules exist for encoder-decoder architectures but are secondary.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment