Implementation:Pyro ppl Pyro FactorMuE
Appearance
| Property | Value |
|---|---|
| Implementation Type | Pattern Doc |
| Source File | examples/contrib/mue/FactorMuE.py
|
| Module | pyro.contrib.mue |
| Pyro Features | pyro.contrib.mue.models.FactorMuE, pyro.contrib.mue.dataloaders.BiosequenceDataset, SVI, MultiStepLR scheduler, JIT compilation
|
| Paper | Weinstein & Marks (2021), "A structured observation distribution for generative biological sequence prediction and forecasting" |
Overview
This file provides a training script for the FactorMuE model, a probabilistic PCA model with a MuE (Mutational Effect) observation distribution. It is a generative model for variable-length biological sequences (e.g., proteins) that does not require multiple sequence alignment preprocessing.
The FactorMuE model:
- Learns a latent representation of sequences in a low-dimensional space (z-dim)
- Identifies principal components of sequence variation
- Accounts for alignment uncertainty through the MuE observation model
- Supports indel modeling (insertions and deletions) with configurable priors
The script handles data loading from FASTA files, model construction, SVI-based training with learning rate scheduling, evaluation (log-probability and perplexity), and latent space visualization.
Code Reference
def main(args):
dataset = BiosequenceDataset(args.file, "fasta", args.alphabet,
include_stop=args.include_stop, device=device)
model = FactorMuE(
dataset.max_length, dataset.alphabet_length, args.z_dim,
batch_size=args.batch_size,
latent_seq_length=args.latent_seq_length,
indel_factor_dependence=args.indel_factor,
indel_prior_scale=args.indel_prior_scale,
indel_prior_bias=args.indel_prior_bias,
z_prior_distribution=args.z_prior,
)
scheduler = MultiStepLR({"optimizer": Adam, "optim_args": {"lr": args.learning_rate},
"milestones": json.loads(args.milestones)})
losses = model.fit_svi(dataset_train, n_epochs, args.anneal, args.batch_size,
scheduler, args.jit)
train_lp, test_lp, train_perplex, test_perplex = model.evaluate(
dataset_train, dataset_test, args.jit)
z_locs, z_scales = model.embed(dataset)
I/O Contract
| Parameter | Type | Description |
|---|---|---|
-f / --file |
str |
Input FASTA file path |
-a / --alphabet |
str |
Alphabet type: "amino-acid", "dna", or custom |
-zdim / --z-dim |
int |
Latent space dimensionality (default: 2) |
-M / --latent-seq-length |
int |
Latent sequence length |
-b / --batch-size |
int |
Batch size (default: 10) |
--anneal |
float |
Number of epochs to anneal beta over |
--split |
float |
Train/test split fraction (default: 0.2) |
Output:
- Training and test log-probability and perplexity
- Latent space embeddings (z_locs, z_scales)
- Loss curve plot, latent space scatter plot, indel probability plots
- Saved parameter store and evaluation results
Usage Examples
# Train FactorMuE on protein sequence data
# python FactorMuE.py -f ve6_full.fasta --z-dim 2 -b 10 -M 174 -D 25 \
# --indel-prior-bias 10. --anneal 5 -e 15 -lr 0.01 --z-prior Laplace --jit --cuda
# Quick test with generated data
# python FactorMuE.py --test --small -e 5
Related Pages
- Pyro_ppl_Pyro_ProfileHMM - Simpler Profile HMM model using the same MuE framework
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment