Implementation:Pyro ppl Pyro FactorMuE

Property	Value
Implementation Type	Pattern Doc
Source File	`examples/contrib/mue/FactorMuE.py`
Module	pyro.contrib.mue
Pyro Features	`pyro.contrib.mue.models.FactorMuE`, `pyro.contrib.mue.dataloaders.BiosequenceDataset`, SVI, `MultiStepLR` scheduler, JIT compilation
Paper	Weinstein & Marks (2021), "A structured observation distribution for generative biological sequence prediction and forecasting"

Overview

This file provides a training script for the FactorMuE model, a probabilistic PCA model with a MuE (Mutational Effect) observation distribution. It is a generative model for variable-length biological sequences (e.g., proteins) that does not require multiple sequence alignment preprocessing.

The FactorMuE model:

Learns a latent representation of sequences in a low-dimensional space (z-dim)
Identifies principal components of sequence variation
Accounts for alignment uncertainty through the MuE observation model
Supports indel modeling (insertions and deletions) with configurable priors

The script handles data loading from FASTA files, model construction, SVI-based training with learning rate scheduling, evaluation (log-probability and perplexity), and latent space visualization.

Code Reference

def main(args):
    dataset = BiosequenceDataset(args.file, "fasta", args.alphabet,
                                  include_stop=args.include_stop, device=device)

    model = FactorMuE(
        dataset.max_length, dataset.alphabet_length, args.z_dim,
        batch_size=args.batch_size,
        latent_seq_length=args.latent_seq_length,
        indel_factor_dependence=args.indel_factor,
        indel_prior_scale=args.indel_prior_scale,
        indel_prior_bias=args.indel_prior_bias,
        z_prior_distribution=args.z_prior,
    )

    scheduler = MultiStepLR({"optimizer": Adam, "optim_args": {"lr": args.learning_rate},
                              "milestones": json.loads(args.milestones)})
    losses = model.fit_svi(dataset_train, n_epochs, args.anneal, args.batch_size,
                            scheduler, args.jit)

    train_lp, test_lp, train_perplex, test_perplex = model.evaluate(
        dataset_train, dataset_test, args.jit)
    z_locs, z_scales = model.embed(dataset)

I/O Contract

Parameter	Type	Description
`-f / --file`	`str`	Input FASTA file path
`-a / --alphabet`	`str`	Alphabet type: "amino-acid", "dna", or custom
`-zdim / --z-dim`	`int`	Latent space dimensionality (default: 2)
`-M / --latent-seq-length`	`int`	Latent sequence length
`-b / --batch-size`	`int`	Batch size (default: 10)
`--anneal`	`float`	Number of epochs to anneal beta over
`--split`	`float`	Train/test split fraction (default: 0.2)

Output:

Training and test log-probability and perplexity
Latent space embeddings (z_locs, z_scales)
Loss curve plot, latent space scatter plot, indel probability plots
Saved parameter store and evaluation results

Usage Examples

# Train FactorMuE on protein sequence data
# python FactorMuE.py -f ve6_full.fasta --z-dim 2 -b 10 -M 174 -D 25 \
#     --indel-prior-bias 10. --anneal 5 -e 15 -lr 0.01 --z-prior Laplace --jit --cuda

# Quick test with generated data
# python FactorMuE.py --test --small -e 5

Related Pages

Pyro_ppl_Pyro_ProfileHMM - Simpler Profile HMM model using the same MuE framework

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment