Implementation:AUTOMATIC1111 Stable diffusion webui XLMRoBERTa M18 Encoder
| Knowledge Sources | |
|---|---|
| Domains | Text Encoding, Transformer Models |
| Last Updated | 2025-05-15 00:00 GMT |
Overview
Implements a variant of the XLM-RoBERTa text encoder (M18 version) that uses a pre-transformation layer to project the second-to-last hidden state for text conditioning in multilingual Stable Diffusion models.
Description
This module is a variant of the standard xlmr module, differing in how hidden states are projected for use as text conditioning. It shares the same BertSeriesConfig and RobertaSeriesConfig configuration classes with the base xlmr module.
The key difference in the BertSeriesModelWithTransformation class is:
- The default projection dimension is 1024 (instead of 768).
- A
has_pre_transformationflag is set to True, enabling an additionaltransformation_prelinear layer alongside the maintransformationlayer. - The
pre_LNlayer normalization andtransformation_preprojection are applied to the second-to-last hidden state (hidden_states[-2]) rather than the last hidden state. - The CLS pooler and post-pooler projection used in the base
xlmrmodule are not applied; instead, the model returns aprojection_statederived from either the pre-transformation path (second-to-last layer) or the standard transformation path (last hidden state) depending on thehas_pre_transformationflag.
The encode() method and tokenization logic remain identical to the base version. The forward() method returns a dictionary with projection_state, last_hidden_state, hidden_states, and attentions (omitting pooler_output and sequence_out compared to the base version).
Usage
Use this module for Stable Diffusion models that were trained with the M18 variant of the XLM-RoBERTa text encoder, which uses penultimate layer features for conditioning rather than the final layer.
Code Reference
Source Location
- Repository: AUTOMATIC1111_Stable_diffusion_webui
- File: modules/xlmr_m18.py
- Lines: 1-166
Signature
class BertSeriesConfig(BertConfig):
def __init__(self, ..., project_dim=512, pooler_fn="average", learn_encoder=False, ...)
class RobertaSeriesConfig(XLMRobertaConfig):
def __init__(self, ..., project_dim=512, pooler_fn='cls', learn_encoder=False, ...)
class BertSeriesModelWithTransformation(BertPreTrainedModel):
has_pre_transformation: bool = True
def __init__(self, config=None, **kargs)
def encode(self, c: str) -> Tensor
def forward(self, input_ids, attention_mask, ...) -> dict
class RobertaSeriesModelWithTransformation(BertSeriesModelWithTransformation):
base_model_prefix = 'roberta'
config_class = RobertaSeriesConfig
Import
from modules.xlmr_m18 import BertSeriesModelWithTransformation, RobertaSeriesModelWithTransformation
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| c | str | Yes | Text string to encode via the encode() method. |
| input_ids | Tensor | No | Tokenized input IDs of shape [batch, seq_len] for forward(). |
| attention_mask | Tensor | No | Attention mask tensor of shape [batch, seq_len] for forward(). |
| token_type_ids | Tensor | No | Token type IDs for forward(). |
| position_ids | Tensor | No | Position IDs for forward(). |
| head_mask | Tensor | No | Head mask for forward(). |
| inputs_embeds | Tensor | No | Pre-computed input embeddings for forward(). |
| output_attentions | bool | No | Whether to return attention weights. |
| return_dict | bool | No | Whether to return a dict; defaults to config value. |
| output_hidden_states | bool | No | Whether to return all hidden states. |
Outputs
| Name | Type | Description |
|---|---|---|
| projection_state | Tensor | Projected hidden states from the second-to-last layer, shape [batch, seq_len, 1024]. |
| last_hidden_state | Tensor | Last layer hidden states from the RoBERTa backbone. |
| hidden_states | tuple[Tensor] | All layer hidden states from the RoBERTa backbone. |
| attentions | tuple[Tensor] | Attention weights from all layers. |
Usage Examples
from modules.xlmr_m18 import BertSeriesModelWithTransformation
# Load the M18 variant encoder
encoder = BertSeriesModelWithTransformation()
encoder = encoder.to("cuda")
# Encode text - uses penultimate layer features
embeddings = encoder.encode("a digital painting of a fantasy landscape")
print(embeddings.shape) # [1, 77, 1024]
# Direct forward pass
outputs = encoder(input_ids=token_ids, attention_mask=mask)
# projection_state comes from hidden_states[-2] via pre_LN + transformation_pre
conditioning = outputs['projection_state']