Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Lucidrains X transformers PyTorch CUDA

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Deep_Learning
Last Updated 2026-02-08 18:00 GMT

Overview

PyTorch 2.0+ environment with CUDA support for GPU-accelerated transformer training and inference.

Description

This environment provides the core runtime for all x-transformers model training and inference. It requires PyTorch 2.0 or newer as the base deep learning framework. CUDA-capable GPUs are strongly recommended for practical training speeds, though CPU execution is supported for small-scale experiments. The library leverages PyTorch's native scaled dot-product attention (SDPA) and optionally supports the flash-attn package for packed sequence processing on Ampere+ GPUs.

Usage

Use this environment for all x-transformers workflows: autoregressive language modeling, encoder-decoder sequence-to-sequence, non-autoregressive masked generation, and DPO preference alignment. Every Implementation page in the wiki requires this environment as a prerequisite.

System Requirements

Category Requirement Notes
OS Linux (Ubuntu 20.04+) macOS and Windows supported for development; Linux recommended for training
Hardware NVIDIA GPU (recommended) CPU fallback supported; SM80+ (Ampere/Hopper) required for flash-attn packed sequences
VRAM 8GB+ recommended Depends on model size; small demo models (dim=512, depth=6) fit in 4GB
Python >= 3.9 Specified in pyproject.toml

Dependencies

Core Packages

  • `torch` >= 2.0
  • `einx` >= 0.3.0
  • `einops` >= 0.8.0
  • `loguru`
  • `packaging` >= 21.0

Optional Packages

  • `flash-attn` >= 2.0 (for packed sequence flash attention with block masking)
  • `adam-atan2-pytorch` >= 0.2.2 (for example training scripts)
  • `lion-pytorch` (alternative optimizer for examples)
  • `tqdm` (progress bars for examples)
  • `pytest` (for running tests)

Credentials

No credentials are required. Optional integrations:

  • `WANDB_API_KEY`: Weights & Biases API key for experiment tracking (used in train_enwik8.py example)

Quick Install

# Install core package
pip install x-transformers

# Install with example script dependencies
pip install "x-transformers[examples]"

# Install with flash attention packed sequence support (requires CUDA)
pip install "x-transformers[flash-pack-seq]"

# Install everything
pip install "x-transformers[examples,flash-pack-seq]"

Code Evidence

PyTorch 2.0 version check for flash attention from `attend.py:305-306`:

torch_version = version.parse(torch.__version__)
assert not (flash and torch_version < version.parse('2.0.0')), 'in order to use flash attention, you must be using pytorch 2.0 or above'

GPU capability check for flash-attn packed sequences from `attend.py:315-316`:

major, minor = torch.cuda.get_device_capability()
assert major >= 8, f"block masking with Flash Attention requires SM80+ (Ampere or newer) GPUs, but your GPU has SM{major}{minor}."

flash-attn package import with helpful error from `attend.py:310-314`:

try:
    from flash_attn import flash_attn_varlen_func
    self.flash_attn_varlen_func = flash_attn_varlen_func
except ImportError:
    raise ImportError("block masking with Flash Attention requires the flash-attn package. Please install it with `pip install flash-attn`.")

PyTorch 2.3+ API adaptation for SDP backends from `attend.py:318-333`:

# torch 2.3 uses new backend and context manager
if torch_version >= version.parse('2.3'):
    from torch.nn.attention import SDPBackend
    str_to_backend = dict(
        enable_flash = SDPBackend.FLASH_ATTENTION,
        enable_mem_efficient = SDPBackend.EFFICIENT_ATTENTION,
        enable_math = SDPBackend.MATH,
        enable_cudnn = SDPBackend.CUDNN_ATTENTION
    )
    sdpa_backends = [str_to_backend[enable_str] for enable_str, enable in sdp_kwargs.items() if enable]
    self.sdp_context_manager = partial(torch.nn.attention.sdpa_kernel, sdpa_backends)
else:
    self.sdp_context_manager = partial(torch.backends.cuda.sdp_kernel, **sdp_kwargs)

CUDA device requirement in training script from `train_enwik8.py:62`:

model = AutoregressiveWrapper(model).cuda()

Device detection fallback from `train_copy.py:15`:

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

Common Errors

Error Message Cause Solution
`in order to use flash attention, you must be using pytorch 2.0 or above` PyTorch version < 2.0 with `attn_flash=True` Upgrade PyTorch: `pip install torch>=2.0`
`block masking with Flash Attention requires SM80+ (Ampere or newer) GPUs` GPU compute capability < 8.0 with flash pack seq enabled Use a newer GPU (A100, RTX 3090+) or disable flash pack seq
`block masking with Flash Attention requires the flash-attn package` flash-attn not installed when using packed sequence block masking `pip install flash-attn`
`ImportError: No module named 'einx'` Missing core dependency `pip install x-transformers` (installs all core deps)

Compatibility Notes

  • PyTorch 2.3+: Uses new `torch.nn.attention.SDPBackend` API for SDP kernel selection. Older versions use `torch.backends.cuda.sdp_kernel`.
  • ONNX Export: Set `onnxable=True` on the Attend class to use an alternative causal mask implementation (avoids `.triu()` which ONNX CPU does not support).
  • Mixed Precision (AMP): Rotary and polar positional embeddings explicitly disable autocast (`@autocast('cuda', enabled=False)`) to maintain FP32 precision for frequency calculations.
  • CPU Training: Supported for small models. Training scripts like `train_copy.py` auto-detect CUDA availability.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment