Principle:Lucidrains X transformers Transformer As MLP

Knowledge Sources	NeoMLP LVSM
Domains	Deep_Learning, Model_Architecture, Neural_Networks
Last Updated	2026-02-08 18:00 GMT

Overview

Technique that replaces traditional MLPs with transformer-based self-attention over node embeddings, treating input, hidden, and output dimensions as a fully connected graph.

Description

The Transformer-as-MLP principle reconceptualizes a multi-layer perceptron as a message-passing network. Instead of fixed weight matrices connecting layers, each input, hidden, and output dimension is represented as a learnable "node" embedding. Continuous input values are encoded via random Fourier features and added to the corresponding input node embeddings. All nodes are concatenated into a single token sequence and processed by a transformer encoder, where self-attention enables every node to communicate with every other node (equivalent to a fully connected graph). The output is read from the output node embeddings. This replaces the rigid linear-then-nonlinear structure of MLPs with flexible learned interactions.

Usage

Use this principle when experimenting with alternatives to standard MLPs, particularly in settings where the input-output mapping may benefit from attention-based routing. The approach is relevant for novel view synthesis, neural radiance fields, and any function approximation task where flexible inter-dimensional communication is desirable.

Theoretical Basis

Pseudo-code Logic:

# Abstract algorithm (NOT real implementation)
# Represent each dimension as a node
input_nodes = learnable_embed[0:dim_in] + fourier_encode(x)
hidden_nodes = learnable_embed[dim_in:dim_in+dim_hidden]
output_nodes = learnable_embed[-dim_out:]

# Self-attention as message passing on fully connected graph
all_nodes = concat(input_nodes, hidden_nodes, output_nodes)
all_nodes = transformer_encoder(all_nodes)

# Read output from output nodes
output = project(all_nodes[-dim_out:])

The key insight is that self-attention naturally implements a learned, data-dependent routing between nodes, replacing the fixed connectivity of weight matrices.

Related Pages

Implementation:Lucidrains_X_transformers_NeoMLP

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment