Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:FlagOpen FlagEmbedding RetroMAE Pretraining

From Leeroopedia


Knowledge Sources
Domains Machine Learning, Self-Supervised Learning, Text Embeddings, Autoencoders
Last Updated 2026-02-09 00:00 GMT

Overview

Masked auto-encoder pretraining for text encoders that uses asymmetric encoder-decoder architectures to learn robust sentence representations through reconstruction objectives.

Description

RetroMAE (Retrieval-oriented Masked Auto-Encoder) is a self-supervised pretraining approach specifically designed for training text encoders for retrieval tasks. Unlike standard masked language models that predict individual masked tokens, RetroMAE masks large spans of text and trains an encoder to produce representations that enable a decoder to reconstruct the original sequence. The key innovation is the asymmetric design: the encoder sees only unmasked tokens (aggressive masking, 50-70%), while an enhanced decoder reconstructs the full sequence from the encoder's compressed representation. This forces the encoder to capture semantic information rather than relying on local context, creating embeddings optimized for retrieval. The pretrained encoder can then be fine-tuned with contrastive learning for downstream retrieval tasks.

Usage

Use this principle when:

  • Pretraining text encoders for retrieval tasks from scratch
  • Adapting encoder models to be more retrieval-oriented
  • Building domain-specific embedding models with unlabeled data
  • Creating foundational representations before contrastive fine-tuning

Theoretical Basis

The RetroMAE pretraining framework consists of:

  1. Masking Strategy:
    • Aggressive masking: Remove 50-70% of tokens to create X_masked
    • Encoder sees only unmasked tokens: X_enc = X_masked
    • Decoder reconstructs full sequence: X_dec = X_original
  1. Asymmetric Architecture:
    • Encoder: Standard BERT-style transformer (6-12 layers)
    • Enhanced Decoder: Deeper/wider decoder (8-12 layers) with:
      • Positional embeddings for masked positions
      • Cross-attention to encoder representations
      • Auto-regressive or non-autoregressive generation
  1. Training Objective:
    • Reconstruction loss: L = -Σ_i log P(x_i | h_enc, X_masked)
    • Where h_enc = Encoder(X_masked) is the sentence representation
    • Forces encoder to preserve semantic content under extreme masking
  1. Advantages for Retrieval:
    • Encoder learns to capture global semantics, not local patterns
    • Robust to missing or partial information
    • Better initialization than MLM for contrastive fine-tuning
  1. Two-stage Training:
    • Stage 1: RetroMAE pretraining on large unlabeled corpus
    • Stage 2: Contrastive fine-tuning on labeled retrieval data

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment