Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Pyro ppl Pyro Data Provenance

From Leeroopedia


Knowledge Sources
Domains Data Provenance, Probabilistic Programming, Automatic Differentiation
Last Updated 2026-02-09 09:00 GMT

Overview

Data provenance tracking records the origin and flow of tensor values through computations, enabling the probabilistic programming system to determine which sample sites influence which downstream computations.

Description

In a probabilistic program, the dependency structure between sample sites is crucial for inference. For example, the TraceGraph ELBO uses dependency information to determine which cost terms should be included in the gradient for each sample site (Rao-Blackwellization). Determining these dependencies requires knowing which tensors "flow into" which computations.

Provenance tracking solves this by attaching provenance metadata to tensors that records which sample sites contributed to their values. As tensors flow through computations (additions, multiplications, function calls), the provenance metadata is propagated: the output of any operation inherits the provenance of all its inputs.

The ProvenanceTensor wraps a standard PyTorch tensor with a set of provenance tags (typically sample site names). When two ProvenanceTensors are combined in an operation, the result's provenance is the union of the inputs' provenances.

This mechanism enables:

  • Dependency graph construction: After running the model, the provenance of each site's distribution parameters reveals which upstream sites influence it.
  • Rao-Blackwellization: The dependency graph identifies which ELBO terms are downstream of each non-reparameterizable site.
  • Debugging: Provenance helps identify unexpected dependencies between model components.
  • Conditional independence verification: Sites with non-overlapping provenance are conditionally independent.

The tracking is lightweight: only set operations (union) are performed, and the metadata does not affect the numerical computation.

Usage

Use data provenance tracking when:

  • Building dependency graphs for variance reduction (TraceGraph ELBO).
  • Debugging probabilistic programs to understand which sites affect which.
  • Verifying conditional independence assumptions in the model.
  • Implementing inference algorithms that exploit graphical structure.

Theoretical Basis

Provenance propagation rules:

# Each tensor t has a provenance set: prov(t) = set of sample site names

# For sample sites:
# z = pyro.sample("z", dist)
# prov(z) = {"z"}

# For operations:
# y = f(t_1, t_2, ..., t_k)
# prov(y) = prov(t_1) union prov(t_2) union ... union prov(t_k)

# For constants (data, parameters):
# prov(c) = {}  (empty set)

Dependency graph extraction:

# After running model with provenance tracking:
# For each sample site s:
#   s.distribution_params have provenance sets

# Site j depends on site i if:
#   i in prov(dist_params(j))

# Dependency graph G:
# Nodes: sample sites
# Edge i -> j exists iff prov(dist_params(j)) contains i

# This graph is a DAG (directed acyclic graph) for valid programs

Application to Rao-Blackwellization:

# For non-reparameterizable site z_i:
# Score function gradient:
# grad_phi_i ELBO = E[cost_i * grad_phi_i log q(z_i)]

# Full cost: cost_i = sum over ALL sites: log p_j - log q_j
# Rao-Blackwellized cost: cost_i = sum over DOWNSTREAM sites only

# downstream(i) = {j : i in prov(dist_params(j)) or j == i}

# Variance reduction:
# terms not in downstream(i) are independent of z_i
# E[independent_term * grad log q(z_i)] = E[independent_term] * E[grad log q(z_i)] = 0
# Removing these zero-expectation terms reduces variance

Implementation strategy:

# ProvenanceTensor wraps a tensor with metadata:
class ProvenanceTensor:
    data: Tensor          # the actual numerical data
    provenance: frozenset # set of site names

# Override __torch_function__ to propagate provenance:
# When PyTorch calls any operation on ProvenanceTensors:
#   1. Extract .data from all ProvenanceTensor inputs
#   2. Run the operation on plain tensors
#   3. Wrap result with union of all input provenances
#   4. Return ProvenanceTensor(result, combined_provenance)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment