Principle:Lucidrains X transformers Multi Stream Input Fusion
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Multi_Modal, Model_Architecture |
| Last Updated | 2026-02-08 18:00 GMT |
Overview
Technique that processes multiple named token input streams through a shared transformer by summing their embeddings, enabling multi-modal or multi-type sequence processing.
Description
Multi-Stream Input Fusion is an architecture pattern where multiple parallel token inputs (each with its own vocabulary and embedding table) are combined into a single representation by summing their embeddings. This is the same approach used in BERT for combining token embeddings with segment embeddings and position embeddings. Each input stream contributes additively to the final embedding, which is then processed by shared attention layers. The output can be projected back to separate logit spaces for each input stream. This pattern generalizes to any number of named input types, enabling flexible multi-modal or multi-annotation architectures.
Usage
Use this principle when designing transformer architectures that need to process multiple types of input tokens simultaneously at each position, such as token + type IDs (BERT-style), text + image patch tokens, or any multi-annotation scenario where each position has multiple categorical attributes.
Theoretical Basis
The combined embedding at each position:
Pseudo-code Logic:
# Abstract algorithm (NOT real implementation)
combined_embedding = 0
for name, token_ids in named_inputs.items():
combined_embedding += embedding_table[name](token_ids)
combined_embedding += positional_embedding
hidden = transformer(combined_embedding)
# Separate output heads
logits = {name: output_head[name](hidden) for name in named_inputs}