Principle:MaterializeInc Materialize Workload Replay

Knowledge Sources	MaterializeInc_Materialize
Domains	Testing, Benchmarking, Quality Assurance
Last Updated	2026-02-08 00:00 GMT

Overview

Workload Replay is the practice of capturing real production workload patterns and replaying them against test environments to validate correctness, performance, and resource consumption across Materialize versions.

Motivation

Synthetic benchmarks and unit tests cannot fully represent the complexity and diversity of production workloads. Real customer environments involve heterogeneous data sources (Kafka, PostgreSQL, MySQL, SQL Server, webhooks), complex query patterns with varying concurrency, and data distributions that follow long-tail patterns rather than uniform random distributions. Workload Replay bridges this gap by enabling reproducible, production-representative testing.

Key Tenets

Production Fidelity

Captured workloads preserve the structural characteristics of production environments: the exact schema definitions, source types, connection configurations, cluster sizes, and query patterns. Synthetic data generation uses long-tail distributions (long_tail_int, long_tail_text, long_tail_choice) that mimic real-world data skew rather than uniform random sampling.

Configurable Scaling

Workload parameters are independently scalable through factor multipliers: factor_initial_data controls the volume of seed data, factor_ingestions controls the continuous ingestion rate, and factor_queries controls query concurrency. This allows testing at various scales from smoke tests to full-scale production simulations.

Version Comparison

The framework supports side-by-side benchmarking of different Materialize versions. The benchmark() function in executor.py runs the same workload against two versions, collecting Docker resource statistics (CPU, memory) and query performance metrics, then generates comparison plots using matplotlib.

Anonymization

Production workload captures can be anonymized using mz-workload-anonymize to strip sensitive identifiers and literals while preserving the structural and statistical properties needed for accurate replay. This enables sharing workloads across teams without exposing customer data.

Constraints

Workload YAML files must conform to the mz_workload_version: "1.0.0" format
Cluster replica sizes must match those defined in the cluster_replica_sizes configuration dictionary
The Column class supports a fixed set of SQL types; unsupported types will raise ValueError
Redacted values in captured queries are replaced with NULL during replay

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment