Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:AUTOMATIC1111 Stable diffusion webui XLMRoBERTa Encoder

From Leeroopedia
Revision as of 14:05, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/AUTOMATIC1111_Stable_diffusion_webui_XLMRoBERTa_Encoder.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Text Encoding, Transformer Models
Last Updated 2025-05-15 00:00 GMT

Overview

Implements XLM-RoBERTa-based text encoder models with linear projection transformations, used as the text conditioning backbone for multilingual Stable Diffusion models.

Description

This module defines configuration and model classes for text encoding using XLM-RoBERTa, a multilingual transformer model. It provides two configuration classes:

  • BertSeriesConfig - Extends BertConfig with additional parameters for projection dimension, pooler function type, and encoder learning control.
  • RobertaSeriesConfig - Extends XLMRobertaConfig with the same projection parameters, using CLS pooling by default.

The main model class BertSeriesModelWithTransformation extends BertPreTrainedModel and wraps an XLMRobertaModel backbone. When no config is provided, it uses a default XLM-RoBERTa-Large configuration (1024 hidden size, 24 layers, 16 attention heads, 250002 vocabulary). The model includes a layer normalization (pre_LN) applied to the last hidden state, a linear transformation projecting from hidden size (1024) to a configurable projection dimension (768 by default), and a CLS-token pooler.

The encode(c) method tokenizes input text using the XLM-RoBERTa tokenizer (max length 77, padded), runs the forward pass, and returns the projection state. The forward() method processes input through the RoBERTa backbone with all hidden states enabled, applies layer normalization and linear projection, and returns a dictionary with pooler output, last hidden state, all hidden states, attentions, projection state, and raw sequence output.

RobertaSeriesModelWithTransformation is a subclass alias that sets the base model prefix to "roberta" and uses RobertaSeriesConfig.

Usage

Use this module as the text encoder for Stable Diffusion models that require multilingual text conditioning via XLM-RoBERTa, such as certain community-trained multilingual models.

Code Reference

Source Location

Signature

class BertSeriesConfig(BertConfig):
    def __init__(self, ..., project_dim=512, pooler_fn="average", learn_encoder=False, ...)

class RobertaSeriesConfig(XLMRobertaConfig):
    def __init__(self, ..., project_dim=512, pooler_fn='cls', learn_encoder=False, ...)

class BertSeriesModelWithTransformation(BertPreTrainedModel):
    def __init__(self, config=None, **kargs)
    def encode(self, c: str) -> Tensor
    def forward(self, input_ids, attention_mask, ...) -> dict

class RobertaSeriesModelWithTransformation(BertSeriesModelWithTransformation):
    base_model_prefix = 'roberta'
    config_class = RobertaSeriesConfig

Import

from modules.xlmr import BertSeriesModelWithTransformation, RobertaSeriesModelWithTransformation

I/O Contract

Inputs

Name Type Required Description
c str Yes Text string to encode via the encode() method.
input_ids Tensor No Tokenized input IDs of shape [batch, seq_len] for forward().
attention_mask Tensor No Attention mask tensor of shape [batch, seq_len] for forward().
token_type_ids Tensor No Token type IDs for forward().
position_ids Tensor No Position IDs for forward().
head_mask Tensor No Head mask for forward().
inputs_embeds Tensor No Pre-computed input embeddings for forward().
output_attentions bool No Whether to return attention weights.
return_dict bool No Whether to return a dict; defaults to config value.
output_hidden_states bool No Whether to return all hidden states.

Outputs

Name Type Description
projection_state Tensor Linearly projected hidden states of shape [batch, seq_len, project_dim].
pooler_output Tensor Pooled (CLS token) and projected output.
last_hidden_state Tensor Last layer hidden states from the RoBERTa backbone.
hidden_states tuple[Tensor] All layer hidden states.
attentions tuple[Tensor] Attention weights from all layers.
sequence_out Tensor Raw sequence output from the last layer.

Usage Examples

from modules.xlmr import BertSeriesModelWithTransformation

# Load the encoder
encoder = BertSeriesModelWithTransformation()
encoder = encoder.to("cuda")

# Encode text to get conditioning embeddings
embeddings = encoder.encode("a painting of a sunset over mountains")
print(embeddings.shape)  # [1, 77, 768]

# Direct forward pass with tokenized inputs
outputs = encoder(input_ids=token_ids, attention_mask=mask)
projection = outputs['projection_state']

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment