Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:AUTOMATIC1111 Stable diffusion webui XLMRoBERTa M18 Encoder

From Leeroopedia


Knowledge Sources
Domains Text Encoding, Transformer Models
Last Updated 2025-05-15 00:00 GMT

Overview

Implements a variant of the XLM-RoBERTa text encoder (M18 version) that uses a pre-transformation layer to project the second-to-last hidden state for text conditioning in multilingual Stable Diffusion models.

Description

This module is a variant of the standard xlmr module, differing in how hidden states are projected for use as text conditioning. It shares the same BertSeriesConfig and RobertaSeriesConfig configuration classes with the base xlmr module.

The key difference in the BertSeriesModelWithTransformation class is:

  • The default projection dimension is 1024 (instead of 768).
  • A has_pre_transformation flag is set to True, enabling an additional transformation_pre linear layer alongside the main transformation layer.
  • The pre_LN layer normalization and transformation_pre projection are applied to the second-to-last hidden state (hidden_states[-2]) rather than the last hidden state.
  • The CLS pooler and post-pooler projection used in the base xlmr module are not applied; instead, the model returns a projection_state derived from either the pre-transformation path (second-to-last layer) or the standard transformation path (last hidden state) depending on the has_pre_transformation flag.

The encode() method and tokenization logic remain identical to the base version. The forward() method returns a dictionary with projection_state, last_hidden_state, hidden_states, and attentions (omitting pooler_output and sequence_out compared to the base version).

Usage

Use this module for Stable Diffusion models that were trained with the M18 variant of the XLM-RoBERTa text encoder, which uses penultimate layer features for conditioning rather than the final layer.

Code Reference

Source Location

Signature

class BertSeriesConfig(BertConfig):
    def __init__(self, ..., project_dim=512, pooler_fn="average", learn_encoder=False, ...)

class RobertaSeriesConfig(XLMRobertaConfig):
    def __init__(self, ..., project_dim=512, pooler_fn='cls', learn_encoder=False, ...)

class BertSeriesModelWithTransformation(BertPreTrainedModel):
    has_pre_transformation: bool = True
    def __init__(self, config=None, **kargs)
    def encode(self, c: str) -> Tensor
    def forward(self, input_ids, attention_mask, ...) -> dict

class RobertaSeriesModelWithTransformation(BertSeriesModelWithTransformation):
    base_model_prefix = 'roberta'
    config_class = RobertaSeriesConfig

Import

from modules.xlmr_m18 import BertSeriesModelWithTransformation, RobertaSeriesModelWithTransformation

I/O Contract

Inputs

Name Type Required Description
c str Yes Text string to encode via the encode() method.
input_ids Tensor No Tokenized input IDs of shape [batch, seq_len] for forward().
attention_mask Tensor No Attention mask tensor of shape [batch, seq_len] for forward().
token_type_ids Tensor No Token type IDs for forward().
position_ids Tensor No Position IDs for forward().
head_mask Tensor No Head mask for forward().
inputs_embeds Tensor No Pre-computed input embeddings for forward().
output_attentions bool No Whether to return attention weights.
return_dict bool No Whether to return a dict; defaults to config value.
output_hidden_states bool No Whether to return all hidden states.

Outputs

Name Type Description
projection_state Tensor Projected hidden states from the second-to-last layer, shape [batch, seq_len, 1024].
last_hidden_state Tensor Last layer hidden states from the RoBERTa backbone.
hidden_states tuple[Tensor] All layer hidden states from the RoBERTa backbone.
attentions tuple[Tensor] Attention weights from all layers.

Usage Examples

from modules.xlmr_m18 import BertSeriesModelWithTransformation

# Load the M18 variant encoder
encoder = BertSeriesModelWithTransformation()
encoder = encoder.to("cuda")

# Encode text - uses penultimate layer features
embeddings = encoder.encode("a digital painting of a fantasy landscape")
print(embeddings.shape)  # [1, 77, 1024]

# Direct forward pass
outputs = encoder(input_ids=token_ids, attention_mask=mask)
# projection_state comes from hidden_states[-2] via pre_LN + transformation_pre
conditioning = outputs['projection_state']

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment