Implementation:AUTOMATIC1111 Stable diffusion webui XLMRoBERTa Encoder
| Knowledge Sources | |
|---|---|
| Domains | Text Encoding, Transformer Models |
| Last Updated | 2025-05-15 00:00 GMT |
Overview
Implements XLM-RoBERTa-based text encoder models with linear projection transformations, used as the text conditioning backbone for multilingual Stable Diffusion models.
Description
This module defines configuration and model classes for text encoding using XLM-RoBERTa, a multilingual transformer model. It provides two configuration classes:
BertSeriesConfig- ExtendsBertConfigwith additional parameters for projection dimension, pooler function type, and encoder learning control.RobertaSeriesConfig- ExtendsXLMRobertaConfigwith the same projection parameters, using CLS pooling by default.
The main model class BertSeriesModelWithTransformation extends BertPreTrainedModel and wraps an XLMRobertaModel backbone. When no config is provided, it uses a default XLM-RoBERTa-Large configuration (1024 hidden size, 24 layers, 16 attention heads, 250002 vocabulary). The model includes a layer normalization (pre_LN) applied to the last hidden state, a linear transformation projecting from hidden size (1024) to a configurable projection dimension (768 by default), and a CLS-token pooler.
The encode(c) method tokenizes input text using the XLM-RoBERTa tokenizer (max length 77, padded), runs the forward pass, and returns the projection state. The forward() method processes input through the RoBERTa backbone with all hidden states enabled, applies layer normalization and linear projection, and returns a dictionary with pooler output, last hidden state, all hidden states, attentions, projection state, and raw sequence output.
RobertaSeriesModelWithTransformation is a subclass alias that sets the base model prefix to "roberta" and uses RobertaSeriesConfig.
Usage
Use this module as the text encoder for Stable Diffusion models that require multilingual text conditioning via XLM-RoBERTa, such as certain community-trained multilingual models.
Code Reference
Source Location
- Repository: AUTOMATIC1111_Stable_diffusion_webui
- File: modules/xlmr.py
- Lines: 1-140
Signature
class BertSeriesConfig(BertConfig):
def __init__(self, ..., project_dim=512, pooler_fn="average", learn_encoder=False, ...)
class RobertaSeriesConfig(XLMRobertaConfig):
def __init__(self, ..., project_dim=512, pooler_fn='cls', learn_encoder=False, ...)
class BertSeriesModelWithTransformation(BertPreTrainedModel):
def __init__(self, config=None, **kargs)
def encode(self, c: str) -> Tensor
def forward(self, input_ids, attention_mask, ...) -> dict
class RobertaSeriesModelWithTransformation(BertSeriesModelWithTransformation):
base_model_prefix = 'roberta'
config_class = RobertaSeriesConfig
Import
from modules.xlmr import BertSeriesModelWithTransformation, RobertaSeriesModelWithTransformation
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| c | str | Yes | Text string to encode via the encode() method. |
| input_ids | Tensor | No | Tokenized input IDs of shape [batch, seq_len] for forward(). |
| attention_mask | Tensor | No | Attention mask tensor of shape [batch, seq_len] for forward(). |
| token_type_ids | Tensor | No | Token type IDs for forward(). |
| position_ids | Tensor | No | Position IDs for forward(). |
| head_mask | Tensor | No | Head mask for forward(). |
| inputs_embeds | Tensor | No | Pre-computed input embeddings for forward(). |
| output_attentions | bool | No | Whether to return attention weights. |
| return_dict | bool | No | Whether to return a dict; defaults to config value. |
| output_hidden_states | bool | No | Whether to return all hidden states. |
Outputs
| Name | Type | Description |
|---|---|---|
| projection_state | Tensor | Linearly projected hidden states of shape [batch, seq_len, project_dim]. |
| pooler_output | Tensor | Pooled (CLS token) and projected output. |
| last_hidden_state | Tensor | Last layer hidden states from the RoBERTa backbone. |
| hidden_states | tuple[Tensor] | All layer hidden states. |
| attentions | tuple[Tensor] | Attention weights from all layers. |
| sequence_out | Tensor | Raw sequence output from the last layer. |
Usage Examples
from modules.xlmr import BertSeriesModelWithTransformation
# Load the encoder
encoder = BertSeriesModelWithTransformation()
encoder = encoder.to("cuda")
# Encode text to get conditioning embeddings
embeddings = encoder.encode("a painting of a sunset over mountains")
print(embeddings.shape) # [1, 77, 768]
# Direct forward pass with tokenized inputs
outputs = encoder(input_ids=token_ids, attention_mask=mask)
projection = outputs['projection_state']