Principle:Lucidrains X transformers Transformer As MLP
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Model_Architecture, Neural_Networks |
| Last Updated | 2026-02-08 18:00 GMT |
Overview
Technique that replaces traditional MLPs with transformer-based self-attention over node embeddings, treating input, hidden, and output dimensions as a fully connected graph.
Description
The Transformer-as-MLP principle reconceptualizes a multi-layer perceptron as a message-passing network. Instead of fixed weight matrices connecting layers, each input, hidden, and output dimension is represented as a learnable "node" embedding. Continuous input values are encoded via random Fourier features and added to the corresponding input node embeddings. All nodes are concatenated into a single token sequence and processed by a transformer encoder, where self-attention enables every node to communicate with every other node (equivalent to a fully connected graph). The output is read from the output node embeddings. This replaces the rigid linear-then-nonlinear structure of MLPs with flexible learned interactions.
Usage
Use this principle when experimenting with alternatives to standard MLPs, particularly in settings where the input-output mapping may benefit from attention-based routing. The approach is relevant for novel view synthesis, neural radiance fields, and any function approximation task where flexible inter-dimensional communication is desirable.
Theoretical Basis
Pseudo-code Logic:
# Abstract algorithm (NOT real implementation)
# Represent each dimension as a node
input_nodes = learnable_embed[0:dim_in] + fourier_encode(x)
hidden_nodes = learnable_embed[dim_in:dim_in+dim_hidden]
output_nodes = learnable_embed[-dim_out:]
# Self-attention as message passing on fully connected graph
all_nodes = concat(input_nodes, hidden_nodes, output_nodes)
all_nodes = transformer_encoder(all_nodes)
# Read output from output nodes
output = project(all_nodes[-dim_out:])
The key insight is that self-attention naturally implements a learned, data-dependent routing between nodes, replacing the fixed connectivity of weight matrices.