Principle:Pyro ppl Pyro Biological Sequence Models
| Knowledge Sources | |
|---|---|
| Domains | Bioinformatics, Sequence Analysis, Evolutionary Biology |
| Last Updated | 2026-02-09 09:00 GMT |
Overview
Biological sequence models, including Profile HMMs and MuE (Mutational Effect) models, provide probabilistic frameworks for modeling protein and nucleotide sequences, enabling alignment, homology detection, and fitness prediction.
Description
Biological sequences (proteins, DNA, RNA) are the fundamental data of molecular biology. Probabilistic models of these sequences enable:
- Sequence alignment: Finding the best correspondence between related sequences.
- Homology detection: Determining whether two sequences share a common ancestor.
- Fitness prediction: Predicting the functional effect of mutations.
- Sequence generation: Sampling new sequences with desired properties.
Profile Hidden Markov Models (Profile HMMs): The standard model for protein families. A profile HMM represents a multiple sequence alignment as a left-to-right HMM with three types of states at each position:
- Match states: Emit an amino acid at this position (the "consensus" position).
- Insert states: Emit extra amino acids between consensus positions.
- Delete states: Skip this consensus position (silent states).
The transition and emission probabilities are position-specific, capturing the conservation pattern of the protein family. Profile HMMs are the foundation of tools like HMMER and Pfam.
MuE (Mutational Effect) model: A generative model that factorizes the process of generating biological sequences into:
- Ancestral sequence: A reference sequence from which variants derive.
- Alignment: How the variant aligns to the ancestor (insertions, deletions).
- Mutations: Substitutions relative to the aligned ancestral positions.
The MuE model provides a principled likelihood function for multiple sequence alignments, enabling Bayesian inference of evolutionary parameters and prediction of mutational effects. It separates the effects of alignment (indels) from substitution (point mutations), leading to cleaner statistical estimates.
Usage
Use biological sequence models when:
- Aligning protein or DNA sequences to a reference or to each other.
- Building probabilistic models of protein families for homology search.
- Predicting the fitness effect of mutations in proteins.
- Generating new protein sequences for directed evolution or protein design.
- Analyzing evolutionary relationships between sequences.
Theoretical Basis
Profile HMM architecture:
# For a profile of length L (consensus positions):
# States at position j: M_j (match), I_j (insert), D_j (delete)
# Plus: begin (B) and end (E) states
# Transition probabilities (position-specific):
# M_j -> M_{j+1}: t_MM(j) (match to match)
# M_j -> I_j: t_MI(j) (match to insert)
# M_j -> D_{j+1}: t_MD(j) (match to delete)
# I_j -> M_{j+1}: t_IM(j) (insert to match)
# I_j -> I_j: t_II(j) (insert to insert)
# D_j -> M_{j+1}: t_DM(j) (delete to match)
# D_j -> D_{j+1}: t_DD(j) (delete to delete)
# Emission probabilities:
# e_M(a | j): probability of amino acid a at match state j
# e_I(a | j): probability of amino acid a at insert state j
# Delete states are silent (no emission)
Sequence likelihood under Profile HMM:
# For observed sequence x = (x_1, ..., x_n):
# p(x | profile) = sum over all state paths pi:
# p(pi) * product_{t: pi_t emits} e(x_{emit(t)} | pi_t)
# Computed efficiently via forward algorithm:
# alpha_t(M_j) = e_M(x_t | j) * [alpha_{t-1}(M_{j-1}) * t_MM(j-1)
# + alpha_{t-1}(I_{j-1}) * t_IM(j-1)
# + alpha_t(D_{j-1}) * t_DM(j-1)] # D is silent
# Cost: O(n * L) time
MuE model:
# Generative process for a sequence x:
# 1. Ancestral sequence: a = (a_1, ..., a_L) (reference)
# 2. Alignment: A ~ AlignmentDistribution(L, n)
# A is a binary matrix mapping ancestor positions to sequence positions
# 3. For each sequence position t:
# if aligned to ancestor position j:
# x_t ~ MutationModel(a_j) # substitution distribution
# else:
# x_t ~ InsertionModel() # insertion distribution
# Factored likelihood:
# p(x | a) = sum_A p(A) * product_t p(x_t | a, A)
# The alignment A marginalizes out via dynamic programming
# Similar to forward algorithm but with alignment-specific structure
Mutational effect prediction:
# Given a trained model, predict effect of mutation at position j from a to b:
# delta_fitness(j, a->b) = log p(x_mutant) - log p(x_wildtype)
# This can be decomposed:
# = log e_M(b | j) - log e_M(a | j) (local substitution effect)
# + coupling terms (context-dependent effects)
# The MuE model provides a principled decomposition of mutational effects
# into alignment-related and substitution-related components