Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Pyro ppl Pyro FactorMuE

From Leeroopedia
Revision as of 16:23, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Pyro_ppl_Pyro_FactorMuE.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Property Value
Implementation Type Pattern Doc
Source File examples/contrib/mue/FactorMuE.py
Module pyro.contrib.mue
Pyro Features pyro.contrib.mue.models.FactorMuE, pyro.contrib.mue.dataloaders.BiosequenceDataset, SVI, MultiStepLR scheduler, JIT compilation
Paper Weinstein & Marks (2021), "A structured observation distribution for generative biological sequence prediction and forecasting"

Overview

This file provides a training script for the FactorMuE model, a probabilistic PCA model with a MuE (Mutational Effect) observation distribution. It is a generative model for variable-length biological sequences (e.g., proteins) that does not require multiple sequence alignment preprocessing.

The FactorMuE model:

  • Learns a latent representation of sequences in a low-dimensional space (z-dim)
  • Identifies principal components of sequence variation
  • Accounts for alignment uncertainty through the MuE observation model
  • Supports indel modeling (insertions and deletions) with configurable priors

The script handles data loading from FASTA files, model construction, SVI-based training with learning rate scheduling, evaluation (log-probability and perplexity), and latent space visualization.

Code Reference

def main(args):
    dataset = BiosequenceDataset(args.file, "fasta", args.alphabet,
                                  include_stop=args.include_stop, device=device)

    model = FactorMuE(
        dataset.max_length, dataset.alphabet_length, args.z_dim,
        batch_size=args.batch_size,
        latent_seq_length=args.latent_seq_length,
        indel_factor_dependence=args.indel_factor,
        indel_prior_scale=args.indel_prior_scale,
        indel_prior_bias=args.indel_prior_bias,
        z_prior_distribution=args.z_prior,
    )

    scheduler = MultiStepLR({"optimizer": Adam, "optim_args": {"lr": args.learning_rate},
                              "milestones": json.loads(args.milestones)})
    losses = model.fit_svi(dataset_train, n_epochs, args.anneal, args.batch_size,
                            scheduler, args.jit)

    train_lp, test_lp, train_perplex, test_perplex = model.evaluate(
        dataset_train, dataset_test, args.jit)
    z_locs, z_scales = model.embed(dataset)

I/O Contract

Parameter Type Description
-f / --file str Input FASTA file path
-a / --alphabet str Alphabet type: "amino-acid", "dna", or custom
-zdim / --z-dim int Latent space dimensionality (default: 2)
-M / --latent-seq-length int Latent sequence length
-b / --batch-size int Batch size (default: 10)
--anneal float Number of epochs to anneal beta over
--split float Train/test split fraction (default: 0.2)

Output:

  • Training and test log-probability and perplexity
  • Latent space embeddings (z_locs, z_scales)
  • Loss curve plot, latent space scatter plot, indel probability plots
  • Saved parameter store and evaluation results

Usage Examples

# Train FactorMuE on protein sequence data
# python FactorMuE.py -f ve6_full.fasta --z-dim 2 -b 10 -M 174 -D 25 \
#     --indel-prior-bias 10. --anneal 5 -e 15 -lr 0.01 --z-prior Laplace --jit --cuda

# Quick test with generated data
# python FactorMuE.py --test --small -e 5

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment