Environment:Romsto Speculative Decoding CUDA PyTorch
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, LLMs, Speculative_Decoding |
| Last Updated | 2026-02-14 04:30 GMT |
Overview
Linux environment with CUDA-capable GPU, Python 3.7+, PyTorch 2.3.0, and HuggingFace Transformers 4.41.1 for speculative decoding inference.
Description
This environment provides the GPU-accelerated runtime required to run all speculative decoding generation strategies in this repository. It is built around PyTorch 2.3.0 with CUDA support and HuggingFace Transformers 4.41.1 for model loading (including optional int8 quantization via Quanto). The stack also includes bitsandbytes 0.43.1 for quantization support and accelerate 0.30.1 for device mapping. Terminal UI dependencies (rich, tqdm, termcolor) are needed for the interactive CLI.
Usage
Use this environment for any generation task in this repository: autoregressive decoding, speculative decoding, beam search, and N-gram assisted speculative decoding. It is the mandatory prerequisite for running the Speculative_Generate, Autoregressive_Generate, Ngram_Assisted_Speculative_Generate, LogitsProcessor_Hierarchy, Prune_Cache, AutoModelForCausalLM_From_Pretrained, and InferenceCLI implementations.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu recommended) | Windows not tested; macOS CPU-only |
| Hardware | NVIDIA GPU with CUDA support | Minimum ~8GB VRAM for int8-quantized 3B models; default models are Llama-3.2-3B + 1B |
| Python | Python >= 3.7 | Stated in README |
| Disk | ~20GB free | Model weights for target + drafter downloaded from HuggingFace Hub |
Dependencies
System Packages
- CUDA toolkit (compatible with PyTorch 2.3.0)
- `git` (for cloning repository)
Python Packages
- `torch` == 2.3.0 (pinned in requirements.txt)
- `transformers` == 4.41.1
- `tokenizers` == 0.19.1
- `accelerate` == 0.30.1
- `bitsandbytes` == 0.43.1
- `rich` (any version)
- `tqdm` (any version)
- `termcolor` (any version)
- `numpy` (transitive dependency of torch)
Credentials
The following credentials may be required at runtime:
- `HF_TOKEN`: HuggingFace API token (Read access) — required if target/drafter models are gated (e.g., Llama models require acceptance of license terms on HuggingFace Hub).
Quick Install
# Clone the repository
git clone https://github.com/romsto/Speculative-Decoding.git
cd Speculative-Decoding
# Install all required packages
pip install -r requirements.txt
# Or install manually with pinned versions:
pip install torch==2.3.0 transformers==4.41.1 tokenizers==0.19.1 accelerate==0.30.1 bitsandbytes==0.43.1 rich tqdm termcolor
Code Evidence
Device selection from `infer.py:20` and `infer.py:395`:
class InferenceCLI:
def __init__(self, device: str = "cuda"):
...
self.device = device
# CLI argument default
parser.add_argument("--device", type=str, default="cuda", help="Device to use for inference")
Model loading with quantization from `infer.py:79-108`:
target_quantize = QuantoConfig(weights="int8") # QuantoConfig(weights="int8") None
drafter_quantize = QuantoConfig(weights="int8") # QuantoConfig(weights="int8") None
self.target = AutoModelForCausalLM.from_pretrained(
target_model,
quantization_config=target_quantize,
device_map=self.device,
trust_remote_code=True,
)
Pinned dependency versions from `requirements.txt`:
torch==2.3.0
transformers==4.41.1
tokenizers==0.19.1
accelerate==0.30.1
bitsandbytes==0.43.1
Max sequence length detection from `sampling/speculative_decoding.py:77`:
max_seq_length = target.config.max_position_embeddings if hasattr(target.config, 'max_position_embeddings') else (target.config.max_context_length if hasattr(target.config, 'max_context_length') else 1024)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `RuntimeError: CUDA out of memory` | Target + drafter models exceed GPU VRAM | Use `QuantoConfig(weights="int8")` to quantize models, or use smaller models |
| `OSError: ... is a gated repo` | Llama or other gated model requires HuggingFace authentication | Run `huggingface-cli login` with a valid HF_TOKEN that has accepted the model license |
| `ImportError: No module named 'bitsandbytes'` | bitsandbytes not installed | `pip install bitsandbytes==0.43.1` |
| `AssertionError: Prompt length exceeds maximum sequence length` | Input tokens exceed model's `max_position_embeddings` | Reduce prompt length or use a model with larger context window |
Compatibility Notes
- CUDA required: Default device is `cuda`. CPU inference is possible by passing `--device cpu` but will be significantly slower.
- Model pairing: Target and drafter models must share the same tokenizer and output the same logits shape. The default pairing is Llama-3.2-3B-Instruct (target) + Llama-3.2-1B-Instruct (drafter).
- Quantization: int8 quantization via QuantoConfig is optional. Set `target_quantize = None` and `drafter_quantize = None` in `infer.py` to disable. Requires more VRAM without quantization.
- Encoder-decoder models: Separate `codec_speculative_decoding.py` and `codec_base_decoding.py` modules exist for encoder-decoder architectures but are secondary.
Related Pages
- Implementation:Romsto_Speculative_Decoding_Speculative_Generate
- Implementation:Romsto_Speculative_Decoding_Prune_Cache
- Implementation:Romsto_Speculative_Decoding_LogitsProcessor_Hierarchy
- Implementation:Romsto_Speculative_Decoding_Ngram_Assisted_Speculative_Generate
- Implementation:Romsto_Speculative_Decoding_Autoregressive_Generate
- Implementation:Romsto_Speculative_Decoding_AutoModelForCausalLM_From_Pretrained
- Implementation:Romsto_Speculative_Decoding_InferenceCLI
- Implementation:Romsto_Speculative_Decoding_Autoregressive_Generate_Encoder_Decoder
- Implementation:Romsto_Speculative_Decoding_Speculative_Generate_Encoder_Decoder