Implementation:Pyro ppl Pyro ProfileHMM

Property	Value
Implementation Type	Pattern Doc
Source File	`examples/contrib/mue/ProfileHMM.py`
Module	pyro.contrib.mue
Pyro Features	`pyro.contrib.mue.models.ProfileHMM`, `pyro.contrib.mue.dataloaders.BiosequenceDataset`, SVI, `MultiStepLR` scheduler
References	Durbin et al. (1998), "Biological sequence analysis"; Weinstein & Marks (2021)

Overview

This file provides a training script for the Profile HMM model, a standard probabilistic model for biological sequence families. The Profile HMM corresponds to a constant (delta function) distribution with a MuE observation, making it a special case of the FactorMuE model with no latent factors.

Unlike the FactorMuE, the Profile HMM does not learn a latent representation. Instead, it directly models:

Consensus sequence positions with emission probabilities
Insertion states allowing extra characters between consensus positions
Deletion states allowing consensus positions to be skipped

The model handles variable-length sequences without requiring a pre-computed multiple sequence alignment, learning the alignment implicitly through the MuE observation distribution.

Code Reference

def main(args):
    dataset = BiosequenceDataset(args.file, "fasta", args.alphabet,
                                  include_stop=args.include_stop, device=device)

    latent_seq_length = args.latent_seq_length
    if latent_seq_length is None:
        latent_seq_length = int(dataset.max_length * 1.1)

    model = ProfileHMM(
        latent_seq_length, dataset.alphabet_length,
        prior_scale=args.prior_scale,
        indel_prior_bias=args.indel_prior_bias,
        cuda=args.cuda,
    )

    scheduler = MultiStepLR({"optimizer": Adam, "optim_args": {"lr": args.learning_rate},
                              "milestones": json.loads(args.milestones)})
    losses = model.fit_svi(dataset, n_epochs, args.batch_size, scheduler, args.jit)

I/O Contract

Parameter	Type	Description
`-f / --file`	`str`	Input FASTA file path
`-a / --alphabet`	`str`	Alphabet type: "amino-acid", "dna", or custom
`-M / --latent-seq-length`	`int`	Latent (consensus) sequence length (default: 1.1x max length)
`--prior-scale`	`float`	Prior scale for all parameters (default: 1.0)
`--indel-prior-bias`	`float`	Indel prior bias (default: 10.0)
`--split`	`float`	Train/test split fraction (default: 0.2)

Output:

Training and test log-probability and perplexity
Loss curve plot, insertion/deletion probability plots
Saved parameter store and evaluation results

Usage Examples

# Train ProfileHMM on protein data
# python ProfileHMM.py -f ve6_full.fasta -b 10 -M 174 --indel-prior-bias 10. \
#     -e 15 -lr 0.01 --jit --cuda

# Quick test with generated data
# python ProfileHMM.py --test --small -e 5

Related Pages

Pyro_ppl_Pyro_FactorMuE - More complex FactorMuE model with latent PCA factors

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment