Principle:Pyro ppl Pyro Attend Infer Repeat

Knowledge Sources	Attend, Infer, Repeat: Fast Scene Understanding with Generative Models Spatial Transformer Networks
Domains	Computer Vision, Object Detection, Generative Models
Last Updated	2026-02-09 09:00 GMT

Overview

Attend, Infer, Repeat (AIR) is a structured deep generative model that decomposes a visual scene into a variable number of objects by iteratively attending to, inferring the identity of, and rendering each object.

Description

Traditional generative models for images treat the entire image as a monolithic entity. AIR instead models images as compositions of individual objects, each described by:

Where: The position and scale of the object (learned via a spatial transformer).
What: The appearance or identity of the object (learned via a VAE-like latent code).
Whether: A binary decision of whether another object exists in the scene (learned via a recurrent attention mechanism).

The model generates an image by iteratively:

Deciding whether to add another object (Bernoulli random variable).
If yes, sampling the object's position and scale (continuous latent variables).
Sampling the object's appearance (continuous latent variables).
Rendering the object onto the canvas using a spatial transformer.

This process continues until the model decides to stop adding objects, yielding a variable-length generative process.

The inference network (guide) reverses this process: given an image, it uses a recurrent neural network to:

Attend to a region of the image.
Infer the latent variables (what, where) for the attended object.
Decide whether more objects remain.

AIR is significant because it demonstrates that probabilistic programming can express complex structured models that decompose scenes into objects, learn to count objects, and handle variable numbers of entities -- all trained end-to-end with variational inference.

Usage

Use the AIR model when:

You need unsupervised object discovery in images without bounding box annotations.
The scene contains a variable number of objects from a common class.
You want a generative model that can be used for both generation and inference.
Learning to count objects in images as a byproduct of scene understanding.
Building structured scene representations for downstream reasoning tasks.

Theoretical Basis

Generative model:

# For image x:
# n ~ geometric or bounded discrete  (number of objects)
# For i = 1, ..., n:
#     z_where_i ~ Normal(0, I)      (position, scale)
#     z_what_i ~ Normal(0, I)       (appearance)
#     object_i = decode(z_what_i)    (neural network decoder)
#     canvas += render(object_i, z_where_i)  (spatial transformer)
# x ~ Normal(canvas, sigma^2)       (observation noise)

# Joint distribution:
# p(x, n, z) = p(n) * product_i p(z_where_i) * p(z_what_i) * p(x | canvas(z))

Spatial transformer:

# z_where = (s, tx, ty)  where s=scale, tx,ty=translation
# Transform matrix:
# A = [[s, 0, tx],
#      [0, s, ty]]

# Render: place a small object image onto a larger canvas
# render(object, z_where) = spatial_transform(object, A)

# Attend: extract a patch from the image (inverse transform)
# attend(image, z_where) = spatial_transform(image, A_inverse)

Inference network (guide):

# Recurrent inference:
# h_0 = initial hidden state
# For i = 1, 2, ...:
#     # Decide whether another object exists:
#     z_pres_i ~ Bernoulli(sigmoid(MLP(h_{i-1})))
#     if z_pres_i == 0: stop
#
#     # Infer object location:
#     z_where_i ~ Normal(mu_where(h_{i-1}), sigma_where(h_{i-1}))
#
#     # Attend to object:
#     glimpse_i = attend(image, z_where_i)
#
#     # Infer object identity:
#     z_what_i ~ Normal(mu_what(glimpse_i), sigma_what(glimpse_i))
#
#     # Update RNN state:
#     h_i = RNN(h_{i-1}, z_what_i, z_where_i, glimpse_i)

ELBO for variable-length models:

# ELBO = E_q[log p(x | z) + log p(z) - log q(z | x)]
# where z = {n, z_where_{1:n}, z_what_{1:n}, z_pres_{1:n}}

# Challenge: z_pres is discrete (Bernoulli) -> not reparameterizable
# Solution: use NVIL (neural variational inference and learning)
#   with learned baseline to reduce variance of score function gradient

# The discrete z_pres variables make this a challenging inference problem
# that showcases the flexibility of probabilistic programming

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment