Environment:Lucidrains X transformers PyTorch CUDA

Knowledge Sources	x-transformers PyTorch
Domains	Infrastructure, Deep_Learning
Last Updated	2026-02-08 18:00 GMT

Overview

PyTorch 2.0+ environment with CUDA support for GPU-accelerated transformer training and inference.

Description

This environment provides the core runtime for all x-transformers model training and inference. It requires PyTorch 2.0 or newer as the base deep learning framework. CUDA-capable GPUs are strongly recommended for practical training speeds, though CPU execution is supported for small-scale experiments. The library leverages PyTorch's native scaled dot-product attention (SDPA) and optionally supports the flash-attn package for packed sequence processing on Ampere+ GPUs.

Usage

Use this environment for all x-transformers workflows: autoregressive language modeling, encoder-decoder sequence-to-sequence, non-autoregressive masked generation, and DPO preference alignment. Every Implementation page in the wiki requires this environment as a prerequisite.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu 20.04+)	macOS and Windows supported for development; Linux recommended for training
Hardware	NVIDIA GPU (recommended)	CPU fallback supported; SM80+ (Ampere/Hopper) required for flash-attn packed sequences
VRAM	8GB+ recommended	Depends on model size; small demo models (dim=512, depth=6) fit in 4GB
Python	>= 3.9	Specified in pyproject.toml

Dependencies

Core Packages

`torch` >= 2.0
`einx` >= 0.3.0
`einops` >= 0.8.0
`loguru`
`packaging` >= 21.0

Optional Packages

`flash-attn` >= 2.0 (for packed sequence flash attention with block masking)
`adam-atan2-pytorch` >= 0.2.2 (for example training scripts)
`lion-pytorch` (alternative optimizer for examples)
`tqdm` (progress bars for examples)
`pytest` (for running tests)

Credentials

No credentials are required. Optional integrations:

`WANDB_API_KEY`: Weights & Biases API key for experiment tracking (used in train_enwik8.py example)

Quick Install

# Install core package
pip install x-transformers

# Install with example script dependencies
pip install "x-transformers[examples]"

# Install with flash attention packed sequence support (requires CUDA)
pip install "x-transformers[flash-pack-seq]"

# Install everything
pip install "x-transformers[examples,flash-pack-seq]"

Code Evidence

PyTorch 2.0 version check for flash attention from `attend.py:305-306`:

torch_version = version.parse(torch.__version__)
assert not (flash and torch_version < version.parse('2.0.0')), 'in order to use flash attention, you must be using pytorch 2.0 or above'

GPU capability check for flash-attn packed sequences from `attend.py:315-316`:

major, minor = torch.cuda.get_device_capability()
assert major >= 8, f"block masking with Flash Attention requires SM80+ (Ampere or newer) GPUs, but your GPU has SM{major}{minor}."

flash-attn package import with helpful error from `attend.py:310-314`:

try:
    from flash_attn import flash_attn_varlen_func
    self.flash_attn_varlen_func = flash_attn_varlen_func
except ImportError:
    raise ImportError("block masking with Flash Attention requires the flash-attn package. Please install it with `pip install flash-attn`.")

PyTorch 2.3+ API adaptation for SDP backends from `attend.py:318-333`:

# torch 2.3 uses new backend and context manager
if torch_version >= version.parse('2.3'):
    from torch.nn.attention import SDPBackend
    str_to_backend = dict(
        enable_flash = SDPBackend.FLASH_ATTENTION,
        enable_mem_efficient = SDPBackend.EFFICIENT_ATTENTION,
        enable_math = SDPBackend.MATH,
        enable_cudnn = SDPBackend.CUDNN_ATTENTION
    )
    sdpa_backends = [str_to_backend[enable_str] for enable_str, enable in sdp_kwargs.items() if enable]
    self.sdp_context_manager = partial(torch.nn.attention.sdpa_kernel, sdpa_backends)
else:
    self.sdp_context_manager = partial(torch.backends.cuda.sdp_kernel, **sdp_kwargs)

CUDA device requirement in training script from `train_enwik8.py:62`:

model = AutoregressiveWrapper(model).cuda()

Device detection fallback from `train_copy.py:15`:

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

Common Errors

Error Message	Cause	Solution
`in order to use flash attention, you must be using pytorch 2.0 or above`	PyTorch version < 2.0 with `attn_flash=True`	Upgrade PyTorch: `pip install torch>=2.0`
`block masking with Flash Attention requires SM80+ (Ampere or newer) GPUs`	GPU compute capability < 8.0 with flash pack seq enabled	Use a newer GPU (A100, RTX 3090+) or disable flash pack seq
`block masking with Flash Attention requires the flash-attn package`	flash-attn not installed when using packed sequence block masking	`pip install flash-attn`
`ImportError: No module named 'einx'`	Missing core dependency	`pip install x-transformers` (installs all core deps)

Compatibility Notes

PyTorch 2.3+: Uses new `torch.nn.attention.SDPBackend` API for SDP kernel selection. Older versions use `torch.backends.cuda.sdp_kernel`.
ONNX Export: Set `onnxable=True` on the Attend class to use an alternative causal mask implementation (avoids `.triu()` which ONNX CPU does not support).
Mixed Precision (AMP): Rotary and polar positional embeddings explicitly disable autocast (`@autocast('cuda', enabled=False)`) to maintain FP32 precision for frequency calculations.
CPU Training: Supported for small models. Training scripts like `train_copy.py` auto-detect CUDA availability.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment