Principle:Pyro ppl Pyro Autoregressive Networks

Knowledge Sources	MADE: Masked Autoencoder for Distribution Estimation Masked Autoregressive Flow for Density Estimation Improving Variational Inference with Inverse Autoregressive Flow
Domains	Autoregressive Models, Neural Density Estimation, Normalizing Flows
Last Updated	2026-02-09 09:00 GMT

Overview

Autoregressive networks use masked neural architectures to model the joint distribution of a vector as a product of conditionals, ensuring each output dimension depends only on previous dimensions through architectural constraints.

Description

The autoregressive property factorizes a joint distribution into a product of conditionals:

p(x_1, x_2, ..., x_d) = p(x_1) * p(x_2 | x_1) * p(x_3 | x_1, x_2) * ... * p(x_d | x_1, ..., x_{d-1})

This factorization is always valid (chain rule of probability), but becomes a powerful modeling tool when each conditional is parameterized by a neural network.

MADE (Masked Autoencoder for Distribution Estimation) enforces the autoregressive property through binary masks applied to the weight matrices of a standard feedforward network. Each mask ensures that the output for dimension i depends only on inputs for dimensions 1, ..., i-1. The masks are constructed by:

Assigning each hidden unit a random integer in [1, d-1] (its "connectivity number").
Allowing a connection from layer l to layer l+1 only if the connectivity number of the source unit is less than or equal to that of the target unit.
For the output layer, output i can only receive input from hidden units with connectivity number < i.

This approach is efficient because:

All conditionals are computed in a single forward pass (unlike RNN-based autoregressive models that require d passes).
The masks are precomputed and fixed, adding no runtime cost.
Multiple mask orderings can be used for ensemble averaging.

DenseNN provides a general-purpose dense (fully connected) neural network used as a building block, often serving as the conditional parameter network within autoregressive transforms.

Together, these components form the backbone of normalizing flows (MAF, IAF) and autoregressive density estimators used in variational inference and density estimation.

Usage

Use autoregressive networks when:

Building normalizing flow transforms (MAF, IAF) for flexible variational approximations.
Constructing density estimators that can compute exact log-probabilities.
Implementing conditional distributions where each dimension depends on previous ones.
Need a single-pass architecture for efficient training of autoregressive models.

Theoretical Basis

Autoregressive factorization:

# Chain rule:
# p(x) = product_{i=1}^{d} p(x_i | x_{<i})

# Each conditional parameterized by neural network:
# p(x_i | x_{<i}) = f(x_i; theta_i(x_{<i}))

# Example: Gaussian autoregressive model
# mu_i, sigma_i = NN_i(x_1, ..., x_{i-1})
# x_i | x_{<i} ~ Normal(mu_i, sigma_i^2)

MADE masking scheme:

# For a network with layers W_1, W_2, ..., W_L:
# Assign connectivity numbers:
#   m(input_i) = i  for i = 1, ..., d
#   m(hidden_j) = random integer in [1, d-1]  (for each hidden unit j)
#   m(output_i) = i  for i = 1, ..., d

# Mask for layer l:
#   M_l[j, k] = 1 if m(unit_k^{l-1}) <= m(unit_j^l)
#   (for hidden layers)
#   M_L[i, k] = 1 if m(unit_k^{L-1}) < m(output_i)
#   (strict inequality for output layer ensures x_i cannot depend on x_i)

# Effective weight: W_l_masked = W_l * M_l  (element-wise product)

# Result: output_i is a function of (x_1, ..., x_{i-1}) only
# All d outputs computed in a single forward pass

Application to normalizing flows (MAF):

# Masked Autoregressive Flow (MAF):
# Transform z -> x autoregressively:
# mu_i, log_sigma_i = MADE(x_1, ..., x_{i-1})
# x_i = mu_i + sigma_i * z_i

# Inverse (for sampling): sequential, O(d) passes
# z_i = (x_i - mu_i) / sigma_i

# Log-determinant of Jacobian (triangular, so determinant = product of diagonal):
# log |det J| = sum_i log sigma_i

# Density: log p(x) = log p_base(z) - sum_i log sigma_i

Inverse Autoregressive Flow (IAF):

# Reverse direction: fast sampling, slow density evaluation
# mu_i, log_sigma_i = MADE(z_1, ..., z_{i-1})
# x_i = mu_i + sigma_i * z_i

# Sampling: single forward pass (fast)
# Density evaluation: sequential (slow)
# Complementary to MAF: IAF is preferred for variational inference
# (sampling is in the inner loop), MAF for density estimation

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment