Implementation:Pyro ppl Pyro ProfileHMM
| Property | Value |
|---|---|
| Implementation Type | Pattern Doc |
| Source File | examples/contrib/mue/ProfileHMM.py
|
| Module | pyro.contrib.mue |
| Pyro Features | pyro.contrib.mue.models.ProfileHMM, pyro.contrib.mue.dataloaders.BiosequenceDataset, SVI, MultiStepLR scheduler
|
| References | Durbin et al. (1998), "Biological sequence analysis"; Weinstein & Marks (2021) |
Overview
This file provides a training script for the Profile HMM model, a standard probabilistic model for biological sequence families. The Profile HMM corresponds to a constant (delta function) distribution with a MuE observation, making it a special case of the FactorMuE model with no latent factors.
Unlike the FactorMuE, the Profile HMM does not learn a latent representation. Instead, it directly models:
- Consensus sequence positions with emission probabilities
- Insertion states allowing extra characters between consensus positions
- Deletion states allowing consensus positions to be skipped
The model handles variable-length sequences without requiring a pre-computed multiple sequence alignment, learning the alignment implicitly through the MuE observation distribution.
Code Reference
def main(args):
dataset = BiosequenceDataset(args.file, "fasta", args.alphabet,
include_stop=args.include_stop, device=device)
latent_seq_length = args.latent_seq_length
if latent_seq_length is None:
latent_seq_length = int(dataset.max_length * 1.1)
model = ProfileHMM(
latent_seq_length, dataset.alphabet_length,
prior_scale=args.prior_scale,
indel_prior_bias=args.indel_prior_bias,
cuda=args.cuda,
)
scheduler = MultiStepLR({"optimizer": Adam, "optim_args": {"lr": args.learning_rate},
"milestones": json.loads(args.milestones)})
losses = model.fit_svi(dataset, n_epochs, args.batch_size, scheduler, args.jit)
I/O Contract
| Parameter | Type | Description |
|---|---|---|
-f / --file |
str |
Input FASTA file path |
-a / --alphabet |
str |
Alphabet type: "amino-acid", "dna", or custom |
-M / --latent-seq-length |
int |
Latent (consensus) sequence length (default: 1.1x max length) |
--prior-scale |
float |
Prior scale for all parameters (default: 1.0) |
--indel-prior-bias |
float |
Indel prior bias (default: 10.0) |
--split |
float |
Train/test split fraction (default: 0.2) |
Output:
- Training and test log-probability and perplexity
- Loss curve plot, insertion/deletion probability plots
- Saved parameter store and evaluation results
Usage Examples
# Train ProfileHMM on protein data
# python ProfileHMM.py -f ve6_full.fasta -b 10 -M 174 --indel-prior-bias 10. \
# -e 15 -lr 0.01 --jit --cuda
# Quick test with generated data
# python ProfileHMM.py --test --small -e 5
Related Pages
- Pyro_ppl_Pyro_FactorMuE - More complex FactorMuE model with latent PCA factors