Environment:Lucidrains X transformers PyTorch CUDA
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Deep_Learning |
| Last Updated | 2026-02-08 18:00 GMT |
Overview
PyTorch 2.0+ environment with CUDA support for GPU-accelerated transformer training and inference.
Description
This environment provides the core runtime for all x-transformers model training and inference. It requires PyTorch 2.0 or newer as the base deep learning framework. CUDA-capable GPUs are strongly recommended for practical training speeds, though CPU execution is supported for small-scale experiments. The library leverages PyTorch's native scaled dot-product attention (SDPA) and optionally supports the flash-attn package for packed sequence processing on Ampere+ GPUs.
Usage
Use this environment for all x-transformers workflows: autoregressive language modeling, encoder-decoder sequence-to-sequence, non-autoregressive masked generation, and DPO preference alignment. Every Implementation page in the wiki requires this environment as a prerequisite.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu 20.04+) | macOS and Windows supported for development; Linux recommended for training |
| Hardware | NVIDIA GPU (recommended) | CPU fallback supported; SM80+ (Ampere/Hopper) required for flash-attn packed sequences |
| VRAM | 8GB+ recommended | Depends on model size; small demo models (dim=512, depth=6) fit in 4GB |
| Python | >= 3.9 | Specified in pyproject.toml |
Dependencies
Core Packages
- `torch` >= 2.0
- `einx` >= 0.3.0
- `einops` >= 0.8.0
- `loguru`
- `packaging` >= 21.0
Optional Packages
- `flash-attn` >= 2.0 (for packed sequence flash attention with block masking)
- `adam-atan2-pytorch` >= 0.2.2 (for example training scripts)
- `lion-pytorch` (alternative optimizer for examples)
- `tqdm` (progress bars for examples)
- `pytest` (for running tests)
Credentials
No credentials are required. Optional integrations:
- `WANDB_API_KEY`: Weights & Biases API key for experiment tracking (used in train_enwik8.py example)
Quick Install
# Install core package
pip install x-transformers
# Install with example script dependencies
pip install "x-transformers[examples]"
# Install with flash attention packed sequence support (requires CUDA)
pip install "x-transformers[flash-pack-seq]"
# Install everything
pip install "x-transformers[examples,flash-pack-seq]"
Code Evidence
PyTorch 2.0 version check for flash attention from `attend.py:305-306`:
torch_version = version.parse(torch.__version__)
assert not (flash and torch_version < version.parse('2.0.0')), 'in order to use flash attention, you must be using pytorch 2.0 or above'
GPU capability check for flash-attn packed sequences from `attend.py:315-316`:
major, minor = torch.cuda.get_device_capability()
assert major >= 8, f"block masking with Flash Attention requires SM80+ (Ampere or newer) GPUs, but your GPU has SM{major}{minor}."
flash-attn package import with helpful error from `attend.py:310-314`:
try:
from flash_attn import flash_attn_varlen_func
self.flash_attn_varlen_func = flash_attn_varlen_func
except ImportError:
raise ImportError("block masking with Flash Attention requires the flash-attn package. Please install it with `pip install flash-attn`.")
PyTorch 2.3+ API adaptation for SDP backends from `attend.py:318-333`:
# torch 2.3 uses new backend and context manager
if torch_version >= version.parse('2.3'):
from torch.nn.attention import SDPBackend
str_to_backend = dict(
enable_flash = SDPBackend.FLASH_ATTENTION,
enable_mem_efficient = SDPBackend.EFFICIENT_ATTENTION,
enable_math = SDPBackend.MATH,
enable_cudnn = SDPBackend.CUDNN_ATTENTION
)
sdpa_backends = [str_to_backend[enable_str] for enable_str, enable in sdp_kwargs.items() if enable]
self.sdp_context_manager = partial(torch.nn.attention.sdpa_kernel, sdpa_backends)
else:
self.sdp_context_manager = partial(torch.backends.cuda.sdp_kernel, **sdp_kwargs)
CUDA device requirement in training script from `train_enwik8.py:62`:
model = AutoregressiveWrapper(model).cuda()
Device detection fallback from `train_copy.py:15`:
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `in order to use flash attention, you must be using pytorch 2.0 or above` | PyTorch version < 2.0 with `attn_flash=True` | Upgrade PyTorch: `pip install torch>=2.0` |
| `block masking with Flash Attention requires SM80+ (Ampere or newer) GPUs` | GPU compute capability < 8.0 with flash pack seq enabled | Use a newer GPU (A100, RTX 3090+) or disable flash pack seq |
| `block masking with Flash Attention requires the flash-attn package` | flash-attn not installed when using packed sequence block masking | `pip install flash-attn` |
| `ImportError: No module named 'einx'` | Missing core dependency | `pip install x-transformers` (installs all core deps) |
Compatibility Notes
- PyTorch 2.3+: Uses new `torch.nn.attention.SDPBackend` API for SDP kernel selection. Older versions use `torch.backends.cuda.sdp_kernel`.
- ONNX Export: Set `onnxable=True` on the Attend class to use an alternative causal mask implementation (avoids `.triu()` which ONNX CPU does not support).
- Mixed Precision (AMP): Rotary and polar positional embeddings explicitly disable autocast (`@autocast('cuda', enabled=False)`) to maintain FP32 precision for frequency calculations.
- CPU Training: Supported for small models. Training scripts like `train_copy.py` auto-detect CUDA availability.
Related Pages
- Implementation:Lucidrains_X_transformers_TransformerWrapper_Decoder_Init
- Implementation:Lucidrains_X_transformers_AutoregressiveWrapper_Init
- Implementation:Lucidrains_X_transformers_AutoregressiveWrapper_Forward
- Implementation:Lucidrains_X_transformers_AutoregressiveWrapper_Generate
- Implementation:Lucidrains_X_transformers_XTransformer_Init
- Implementation:Lucidrains_X_transformers_XTransformer_Forward
- Implementation:Lucidrains_X_transformers_XTransformer_Generate
- Implementation:Lucidrains_X_transformers_TransformerWrapper_Encoder_Init
- Implementation:Lucidrains_X_transformers_NonAutoregressiveWrapper_Init
- Implementation:Lucidrains_X_transformers_NonAutoregressiveWrapper_Forward
- Implementation:Lucidrains_X_transformers_NonAutoregressiveWrapper_Generate
- Implementation:Lucidrains_X_transformers_DPO_Init
- Implementation:Lucidrains_X_transformers_DPO_Forward
- Implementation:Lucidrains_X_transformers_DPO_Policy_Model_Evaluation
- Implementation:Lucidrains_X_transformers_BeliefStateWrapper
- Implementation:Lucidrains_X_transformers_ContinuousTransformerWrapper
- Implementation:Lucidrains_X_transformers_FreeTransformer
- Implementation:Lucidrains_X_transformers_XValTransformerWrapper
- Implementation:Lucidrains_X_transformers_EntropyBasedTokenizer
- Implementation:Lucidrains_X_transformers_GPTVAE
- Implementation:Lucidrains_X_transformers_MultiInputTransformerWrapper
- Implementation:Lucidrains_X_transformers_NeoMLP
- Implementation:Lucidrains_X_transformers_UniversalPretrainWrapper
- Implementation:Lucidrains_X_transformers_XLAutoregressiveWrapper