Principle:MaterializeInc Materialize Workload Replay
| Knowledge Sources | |
|---|---|
| Domains | Testing, Benchmarking, Quality Assurance |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Workload Replay is the practice of capturing real production workload patterns and replaying them against test environments to validate correctness, performance, and resource consumption across Materialize versions.
Motivation
Synthetic benchmarks and unit tests cannot fully represent the complexity and diversity of production workloads. Real customer environments involve heterogeneous data sources (Kafka, PostgreSQL, MySQL, SQL Server, webhooks), complex query patterns with varying concurrency, and data distributions that follow long-tail patterns rather than uniform random distributions. Workload Replay bridges this gap by enabling reproducible, production-representative testing.
Key Tenets
Production Fidelity
Captured workloads preserve the structural characteristics of production environments: the exact schema definitions, source types, connection configurations, cluster sizes, and query patterns. Synthetic data generation uses long-tail distributions (long_tail_int, long_tail_text, long_tail_choice) that mimic real-world data skew rather than uniform random sampling.
Configurable Scaling
Workload parameters are independently scalable through factor multipliers: factor_initial_data controls the volume of seed data, factor_ingestions controls the continuous ingestion rate, and factor_queries controls query concurrency. This allows testing at various scales from smoke tests to full-scale production simulations.
Version Comparison
The framework supports side-by-side benchmarking of different Materialize versions. The benchmark() function in executor.py runs the same workload against two versions, collecting Docker resource statistics (CPU, memory) and query performance metrics, then generates comparison plots using matplotlib.
Anonymization
Production workload captures can be anonymized using mz-workload-anonymize to strip sensitive identifiers and literals while preserving the structural and statistical properties needed for accurate replay. This enables sharing workloads across teams without exposing customer data.
Constraints
- Workload YAML files must conform to the
mz_workload_version: "1.0.0"format - Cluster replica sizes must match those defined in the
cluster_replica_sizesconfiguration dictionary - The
Columnclass supports a fixed set of SQL types; unsupported types will raiseValueError - Redacted values in captured queries are replaced with
NULLduring replay