Principle:Pyro ppl Pyro JIT Compilation
| Knowledge Sources | |
|---|---|
| Domains | Compiler Optimization, Deep Learning, Performance |
| Last Updated | 2026-02-09 09:00 GMT |
Overview
JIT (Just-In-Time) compilation traces or scripts PyTorch computations into an optimized intermediate representation that can be fused, optimized, and executed more efficiently than eager-mode Python code.
Description
PyTorch operates by default in eager mode: each operation is executed immediately as Python encounters it. While this makes debugging easy, it incurs Python interpreter overhead for every operation and prevents cross-operation optimizations.
JIT compilation addresses this by compiling PyTorch computations into TorchScript, an optimized intermediate representation. The compilation can happen via:
- Tracing: Execute the function once with example inputs, recording all tensor operations. The resulting trace is a static computation graph that can be optimized.
- Scripting: Parse the Python source code directly into TorchScript, preserving control flow (if/else, loops).
For probabilistic programming, JIT compilation is particularly valuable because inference involves running the model thousands or millions of times with the same structure but different parameter values. The overhead of Python interpretation becomes significant.
However, probabilistic programs pose unique challenges for JIT:
- Dynamic control flow: The execution path may depend on sampled values.
- Effect handlers: Pyro's messenger system involves complex Python control flow that is difficult to trace.
- Lazy compilation: The model's shape and structure may not be known until the first execution.
Pyro's LazyJIT provides a solution: it delays compilation until the first execution, when concrete shapes and control flow are available. After the first call, subsequent calls use the compiled version. This combines the ease of eager-mode model specification with the performance of compiled execution.
Usage
Use JIT compilation when:
- Running inference with many iterations (SVI, MCMC) where per-iteration overhead matters.
- The model has a fixed computation graph (no data-dependent control flow).
- You need to deploy a trained model for fast inference in production.
- Profiling shows that Python overhead is a significant fraction of total runtime.
- Working with GPU computation where kernel launch overhead can be amortized.
Theoretical Basis
Tracing vs. scripting:
# Tracing: records operations on example inputs
# + Handles any Python code that produces tensor operations
# - Cannot handle data-dependent control flow
# - The trace is fixed: if/else branches are baked in
# Scripting: parses Python source
# + Preserves control flow (if/else, for loops)
# - Limited to a subset of Python (TorchScript-compatible)
# - Cannot handle dynamic Python features (eval, exec, etc.)
Optimization passes applied by JIT:
# 1. Operator fusion: combine multiple elementwise ops into one kernel
# Example: y = relu(x @ W + b) -> fused_linear_relu(x, W, b)
# Reduces memory bandwidth and kernel launch overhead
# 2. Constant folding: precompute expressions with known values
# Example: scale = 1.0 / sqrt(64) -> scale = 0.125
# 3. Dead code elimination: remove unused computations
# 4. Memory planning: reuse buffers for intermediate tensors
# Reduces memory allocation/deallocation overhead
# 5. Algebraic simplification: simplify redundant operations
# Example: x * 1.0 -> x, x + 0.0 -> x
Lazy JIT pattern for probabilistic programs:
# Problem: model shape is unknown until first call
# Solution: delay compilation
# LazyJIT wrapper:
class LazyJIT:
compiled = None
def __call__(*args):
if self.compiled is None:
# First call: trace the model with actual inputs
self.compiled = jit.trace(self.model, args)
return self.compiled(*args)
# Benefits:
# - No need to specify shapes upfront
# - First call pays compilation cost
# - All subsequent calls use optimized version
# - Compatible with Pyro's effect handler system
# (handlers are "baked in" during tracing)
Performance model:
# Total time per inference step:
# T_eager = T_python + T_compute
# T_jit = T_compute_optimized (T_python eliminated, T_compute reduced by fusion)
# Speedup = T_eager / T_jit
# Typically 1.5x - 5x for small models (Python-overhead dominated)
# Smaller gains for large models (compute-dominated)
# When JIT helps most:
# - Many small operations (high Python overhead ratio)
# - GPU execution (kernel launch overhead amortized)
# - Repeated execution (compilation cost amortized)