Implementation:AUTOMATIC1111 Stable diffusion webui XLMRoBERTa M18 Encoder

Knowledge Sources	AUTOMATIC1111_Stable_diffusion_webui
Domains	Text Encoding, Transformer Models
Last Updated	2025-05-15 00:00 GMT

Overview

Implements a variant of the XLM-RoBERTa text encoder (M18 version) that uses a pre-transformation layer to project the second-to-last hidden state for text conditioning in multilingual Stable Diffusion models.

Description

This module is a variant of the standard xlmr module, differing in how hidden states are projected for use as text conditioning. It shares the same BertSeriesConfig and RobertaSeriesConfig configuration classes with the base xlmr module.

The key difference in the BertSeriesModelWithTransformation class is:

The default projection dimension is 1024 (instead of 768).
A has_pre_transformation flag is set to True, enabling an additional transformation_pre linear layer alongside the main transformation layer.
The pre_LN layer normalization and transformation_pre projection are applied to the second-to-last hidden state (hidden_states[-2]) rather than the last hidden state.
The CLS pooler and post-pooler projection used in the base xlmr module are not applied; instead, the model returns a projection_state derived from either the pre-transformation path (second-to-last layer) or the standard transformation path (last hidden state) depending on the has_pre_transformation flag.

The encode() method and tokenization logic remain identical to the base version. The forward() method returns a dictionary with projection_state, last_hidden_state, hidden_states, and attentions (omitting pooler_output and sequence_out compared to the base version).

Usage

Use this module for Stable Diffusion models that were trained with the M18 variant of the XLM-RoBERTa text encoder, which uses penultimate layer features for conditioning rather than the final layer.

Code Reference

Source Location

Repository: AUTOMATIC1111_Stable_diffusion_webui
File: modules/xlmr_m18.py
Lines: 1-166

Signature

class BertSeriesConfig(BertConfig):
    def __init__(self, ..., project_dim=512, pooler_fn="average", learn_encoder=False, ...)

class RobertaSeriesConfig(XLMRobertaConfig):
    def __init__(self, ..., project_dim=512, pooler_fn='cls', learn_encoder=False, ...)

class BertSeriesModelWithTransformation(BertPreTrainedModel):
    has_pre_transformation: bool = True
    def __init__(self, config=None, **kargs)
    def encode(self, c: str) -> Tensor
    def forward(self, input_ids, attention_mask, ...) -> dict

class RobertaSeriesModelWithTransformation(BertSeriesModelWithTransformation):
    base_model_prefix = 'roberta'
    config_class = RobertaSeriesConfig

Import

from modules.xlmr_m18 import BertSeriesModelWithTransformation, RobertaSeriesModelWithTransformation

I/O Contract

Inputs

Name	Type	Required	Description
c	str	Yes	Text string to encode via the encode() method.
input_ids	Tensor	No	Tokenized input IDs of shape [batch, seq_len] for forward().
attention_mask	Tensor	No	Attention mask tensor of shape [batch, seq_len] for forward().
token_type_ids	Tensor	No	Token type IDs for forward().
position_ids	Tensor	No	Position IDs for forward().
head_mask	Tensor	No	Head mask for forward().
inputs_embeds	Tensor	No	Pre-computed input embeddings for forward().
output_attentions	bool	No	Whether to return attention weights.
return_dict	bool	No	Whether to return a dict; defaults to config value.
output_hidden_states	bool	No	Whether to return all hidden states.

Outputs

Name	Type	Description
projection_state	Tensor	Projected hidden states from the second-to-last layer, shape [batch, seq_len, 1024].
last_hidden_state	Tensor	Last layer hidden states from the RoBERTa backbone.
hidden_states	tuple[Tensor]	All layer hidden states from the RoBERTa backbone.
attentions	tuple[Tensor]	Attention weights from all layers.

Usage Examples

from modules.xlmr_m18 import BertSeriesModelWithTransformation

# Load the M18 variant encoder
encoder = BertSeriesModelWithTransformation()
encoder = encoder.to("cuda")

# Encode text - uses penultimate layer features
embeddings = encoder.encode("a digital painting of a fantasy landscape")
print(embeddings.shape)  # [1, 77, 1024]

# Direct forward pass
outputs = encoder(input_ids=token_ids, attention_mask=mask)
# projection_state comes from hidden_states[-2] via pre_LN + transformation_pre
conditioning = outputs['projection_state']

Related Pages

Principle:AUTOMATIC1111_Stable_diffusion_webui_Text_Encoding

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment