Implementation:AUTOMATIC1111 Stable diffusion webui XLMRoBERTa Encoder

Knowledge Sources	AUTOMATIC1111_Stable_diffusion_webui
Domains	Text Encoding, Transformer Models
Last Updated	2025-05-15 00:00 GMT

Overview

Implements XLM-RoBERTa-based text encoder models with linear projection transformations, used as the text conditioning backbone for multilingual Stable Diffusion models.

Description

This module defines configuration and model classes for text encoding using XLM-RoBERTa, a multilingual transformer model. It provides two configuration classes:

BertSeriesConfig - Extends BertConfig with additional parameters for projection dimension, pooler function type, and encoder learning control.
RobertaSeriesConfig - Extends XLMRobertaConfig with the same projection parameters, using CLS pooling by default.

The main model class BertSeriesModelWithTransformation extends BertPreTrainedModel and wraps an XLMRobertaModel backbone. When no config is provided, it uses a default XLM-RoBERTa-Large configuration (1024 hidden size, 24 layers, 16 attention heads, 250002 vocabulary). The model includes a layer normalization (pre_LN) applied to the last hidden state, a linear transformation projecting from hidden size (1024) to a configurable projection dimension (768 by default), and a CLS-token pooler.

The encode(c) method tokenizes input text using the XLM-RoBERTa tokenizer (max length 77, padded), runs the forward pass, and returns the projection state. The forward() method processes input through the RoBERTa backbone with all hidden states enabled, applies layer normalization and linear projection, and returns a dictionary with pooler output, last hidden state, all hidden states, attentions, projection state, and raw sequence output.

RobertaSeriesModelWithTransformation is a subclass alias that sets the base model prefix to "roberta" and uses RobertaSeriesConfig.

Usage

Use this module as the text encoder for Stable Diffusion models that require multilingual text conditioning via XLM-RoBERTa, such as certain community-trained multilingual models.

Code Reference

Source Location

Repository: AUTOMATIC1111_Stable_diffusion_webui
File: modules/xlmr.py
Lines: 1-140

Signature

class BertSeriesConfig(BertConfig):
    def __init__(self, ..., project_dim=512, pooler_fn="average", learn_encoder=False, ...)

class RobertaSeriesConfig(XLMRobertaConfig):
    def __init__(self, ..., project_dim=512, pooler_fn='cls', learn_encoder=False, ...)

class BertSeriesModelWithTransformation(BertPreTrainedModel):
    def __init__(self, config=None, **kargs)
    def encode(self, c: str) -> Tensor
    def forward(self, input_ids, attention_mask, ...) -> dict

class RobertaSeriesModelWithTransformation(BertSeriesModelWithTransformation):
    base_model_prefix = 'roberta'
    config_class = RobertaSeriesConfig

Import

from modules.xlmr import BertSeriesModelWithTransformation, RobertaSeriesModelWithTransformation

I/O Contract

Inputs

Name	Type	Required	Description
c	str	Yes	Text string to encode via the encode() method.
input_ids	Tensor	No	Tokenized input IDs of shape [batch, seq_len] for forward().
attention_mask	Tensor	No	Attention mask tensor of shape [batch, seq_len] for forward().
token_type_ids	Tensor	No	Token type IDs for forward().
position_ids	Tensor	No	Position IDs for forward().
head_mask	Tensor	No	Head mask for forward().
inputs_embeds	Tensor	No	Pre-computed input embeddings for forward().
output_attentions	bool	No	Whether to return attention weights.
return_dict	bool	No	Whether to return a dict; defaults to config value.
output_hidden_states	bool	No	Whether to return all hidden states.

Outputs

Name	Type	Description
projection_state	Tensor	Linearly projected hidden states of shape [batch, seq_len, project_dim].
pooler_output	Tensor	Pooled (CLS token) and projected output.
last_hidden_state	Tensor	Last layer hidden states from the RoBERTa backbone.
hidden_states	tuple[Tensor]	All layer hidden states.
attentions	tuple[Tensor]	Attention weights from all layers.
sequence_out	Tensor	Raw sequence output from the last layer.

Usage Examples

from modules.xlmr import BertSeriesModelWithTransformation

# Load the encoder
encoder = BertSeriesModelWithTransformation()
encoder = encoder.to("cuda")

# Encode text to get conditioning embeddings
embeddings = encoder.encode("a painting of a sunset over mountains")
print(embeddings.shape)  # [1, 77, 768]

# Direct forward pass with tokenized inputs
outputs = encoder(input_ids=token_ids, attention_mask=mask)
projection = outputs['projection_state']

Related Pages

Principle:AUTOMATIC1111_Stable_diffusion_webui_Text_Encoding

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment