Implementation:Lucidrains X transformers XTransformer Init
Metadata
| Field | Value |
|---|---|
| Repository | x-transformers |
| Domains | NLP, Model_Architecture |
| Last Updated | 2026-02-08 18:00 GMT |
Overview
Concrete tool for configuring encoder-decoder sequence-to-sequence transformer models provided by the x-transformers library.
Description
XTransformer combines an encoder TransformerWrapper and a decoder TransformerWrapper (wrapped with AutoregressiveWrapper) into a single module. It accepts all configuration via prefixed keyword arguments: enc_* for encoder settings and dec_* for decoder settings.
Key behaviors of __init__:
- The shared
dimparameter sets the model dimension for both encoder and decoder. - All keyword arguments prefixed with
enc_are extracted and forwarded to the encoderTransformerWrapperand its innerEncoder(AttentionLayerswithcausal=False). - All keyword arguments prefixed with
dec_are extracted and forwarded to the decoderTransformerWrapperand its innerDecoder(AttentionLayerswithcausal=True,cross_attend=True). - The encoder is internally configured with
return_only_embed=True, so it outputs hidden states rather than logits. - The decoder is wrapped in
AutoregressiveWrapperfor automatic input/target splitting and loss computation. tie_token_emb-- WhenTrue, the encoder and decoder share the same token embedding matrix. Useful when source and target vocabularies are identical (e.g., copy tasks, monolingual summarization).cross_attn_tokens_dropout-- Applies dropout to cross-attention tokens during training as a regularization strategy. A fraction of encoder hidden states are randomly dropped before being passed to the decoder cross-attention layers.
Usage
Import XTransformer when building sequence-to-sequence models. Configure encoder and decoder separately via prefixed parameters. Use for machine translation, summarization, copy tasks, or any input-to-output sequence transduction.
Code Reference
| Field | Value |
|---|---|
| Repository | x-transformers |
| File | x_transformers/x_transformers.py
|
| Lines | L3830-3873 |
Signature:
class XTransformer(Module):
def __init__(
self,
*,
dim,
tie_token_emb = False,
ignore_index = -100,
pad_value = 0,
cross_attn_tokens_dropout = 0.,
**kwargs # enc_* and dec_* prefixed params
):
Import:
from x_transformers import XTransformer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
dim |
int |
Yes | Shared model dimension for encoder and decoder |
tie_token_emb |
bool |
No | Tie encoder/decoder token embeddings (default False)
|
enc_num_tokens |
int |
Yes | Encoder vocabulary size |
enc_depth |
int |
Yes | Number of encoder layers |
enc_heads |
int |
Yes | Number of encoder attention heads |
enc_max_seq_len |
int |
Yes | Maximum encoder sequence length |
dec_num_tokens |
int |
Yes | Decoder vocabulary size |
dec_depth |
int |
Yes | Number of decoder layers |
dec_heads |
int |
Yes | Number of decoder attention heads |
dec_max_seq_len |
int |
Yes | Maximum decoder sequence length |
cross_attn_tokens_dropout |
float |
No | Dropout rate for cross-attention tokens (default 0.)
|
Outputs
| Name | Type | Description |
|---|---|---|
model |
XTransformer |
Module with .encoder (TransformerWrapper) and .decoder (AutoregressiveWrapper)
|
Usage Examples
Copy Task (from train_copy.py)
from x_transformers import XTransformer
model = XTransformer(
dim = 128,
tie_token_emb = True,
return_tgt_loss = True,
enc_num_tokens = 18,
enc_depth = 3,
enc_heads = 8,
enc_max_seq_len = 32,
dec_num_tokens = 18,
dec_depth = 3,
dec_heads = 8,
dec_max_seq_len = 65
).cuda()
Larger Translation Model
model = XTransformer(
dim = 512,
enc_num_tokens = 30000,
enc_depth = 6,
enc_heads = 8,
enc_max_seq_len = 512,
dec_num_tokens = 30000,
dec_depth = 6,
dec_heads = 8,
dec_max_seq_len = 512,
tie_token_emb = True,
cross_attn_tokens_dropout = 0.1
)