Principle:Pyro ppl Pyro Autoregressive Networks
| Knowledge Sources | |
|---|---|
| Domains | Autoregressive Models, Neural Density Estimation, Normalizing Flows |
| Last Updated | 2026-02-09 09:00 GMT |
Overview
Autoregressive networks use masked neural architectures to model the joint distribution of a vector as a product of conditionals, ensuring each output dimension depends only on previous dimensions through architectural constraints.
Description
The autoregressive property factorizes a joint distribution into a product of conditionals:
p(x_1, x_2, ..., x_d) = p(x_1) * p(x_2 | x_1) * p(x_3 | x_1, x_2) * ... * p(x_d | x_1, ..., x_{d-1})
This factorization is always valid (chain rule of probability), but becomes a powerful modeling tool when each conditional is parameterized by a neural network.
MADE (Masked Autoencoder for Distribution Estimation) enforces the autoregressive property through binary masks applied to the weight matrices of a standard feedforward network. Each mask ensures that the output for dimension i depends only on inputs for dimensions 1, ..., i-1. The masks are constructed by:
- Assigning each hidden unit a random integer in [1, d-1] (its "connectivity number").
- Allowing a connection from layer l to layer l+1 only if the connectivity number of the source unit is less than or equal to that of the target unit.
- For the output layer, output i can only receive input from hidden units with connectivity number < i.
This approach is efficient because:
- All conditionals are computed in a single forward pass (unlike RNN-based autoregressive models that require d passes).
- The masks are precomputed and fixed, adding no runtime cost.
- Multiple mask orderings can be used for ensemble averaging.
DenseNN provides a general-purpose dense (fully connected) neural network used as a building block, often serving as the conditional parameter network within autoregressive transforms.
Together, these components form the backbone of normalizing flows (MAF, IAF) and autoregressive density estimators used in variational inference and density estimation.
Usage
Use autoregressive networks when:
- Building normalizing flow transforms (MAF, IAF) for flexible variational approximations.
- Constructing density estimators that can compute exact log-probabilities.
- Implementing conditional distributions where each dimension depends on previous ones.
- Need a single-pass architecture for efficient training of autoregressive models.
Theoretical Basis
Autoregressive factorization:
# Chain rule:
# p(x) = product_{i=1}^{d} p(x_i | x_{<i})
# Each conditional parameterized by neural network:
# p(x_i | x_{<i}) = f(x_i; theta_i(x_{<i}))
# Example: Gaussian autoregressive model
# mu_i, sigma_i = NN_i(x_1, ..., x_{i-1})
# x_i | x_{<i} ~ Normal(mu_i, sigma_i^2)
MADE masking scheme:
# For a network with layers W_1, W_2, ..., W_L:
# Assign connectivity numbers:
# m(input_i) = i for i = 1, ..., d
# m(hidden_j) = random integer in [1, d-1] (for each hidden unit j)
# m(output_i) = i for i = 1, ..., d
# Mask for layer l:
# M_l[j, k] = 1 if m(unit_k^{l-1}) <= m(unit_j^l)
# (for hidden layers)
# M_L[i, k] = 1 if m(unit_k^{L-1}) < m(output_i)
# (strict inequality for output layer ensures x_i cannot depend on x_i)
# Effective weight: W_l_masked = W_l * M_l (element-wise product)
# Result: output_i is a function of (x_1, ..., x_{i-1}) only
# All d outputs computed in a single forward pass
Application to normalizing flows (MAF):
# Masked Autoregressive Flow (MAF):
# Transform z -> x autoregressively:
# mu_i, log_sigma_i = MADE(x_1, ..., x_{i-1})
# x_i = mu_i + sigma_i * z_i
# Inverse (for sampling): sequential, O(d) passes
# z_i = (x_i - mu_i) / sigma_i
# Log-determinant of Jacobian (triangular, so determinant = product of diagonal):
# log |det J| = sum_i log sigma_i
# Density: log p(x) = log p_base(z) - sum_i log sigma_i
Inverse Autoregressive Flow (IAF):
# Reverse direction: fast sampling, slow density evaluation
# mu_i, log_sigma_i = MADE(z_1, ..., z_{i-1})
# x_i = mu_i + sigma_i * z_i
# Sampling: single forward pass (fast)
# Density evaluation: sequential (slow)
# Complementary to MAF: IAF is preferred for variational inference
# (sampling is in the inner loop), MAF for density estimation