Implementation:Interpretml Interpret Synthetic Dataset Generation
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Interpretability |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
make_synthetic is a utility function that generates synthetic datasets with known ground-truth feature effects (including main effects, pairwise interactions, and three-way interactions) for testing and benchmarking interpretable machine learning models.
Description
This module provides a comprehensive synthetic data generation framework designed specifically for testing EBM (Explainable Boosting Machine) and other interpretable models:
- make_synthetic: The primary function that generates a dataset with 10 features (8 continuous, 2 categorical) and a response variable. The response is constructed from known additive terms:
- Main effects: cosine, sine, parabola, linear integer, square wave, sawtooth wave, and exponential transformations on individual features, plus categorical feature effects
- Pairwise interactions: XOR-like interaction, multiplication of continuous features, multiplication of continuous with categorical
- Three-way interaction: Product of continuous, integer, and categorical features
- Unused feature: Feature 7 is deliberately unused to test whether models correctly identify irrelevant features
- For classification tasks, the continuous response is converted to class labels via a logistic transformation with optional multiclass support
The module also includes helper functions:
- _make_synthetic_features: Generates the feature matrix with various distributions (uniform, normal, exponential, Poisson, correlated features)
- _normalize_categoricals: Converts categorical features to numeric for response generation
- _check_synthetic_dataset: Diagnostic function for inspecting generated data
Usage
Use make_synthetic when you need a dataset with known ground-truth effects for validating that an interpretable model correctly recovers feature shapes, interactions, and identifies unused features. Supports multiple output formats including numpy arrays, pandas DataFrames, and scipy sparse matrices.
Code Reference
Source Location
- Repository: Interpretml_Interpret
- File:
python/interpret-core/interpret/utils/_synthetic.py
Signature
def make_synthetic(
classes=("class_0", "class_1"),
n_samples=1000,
missing=False,
seed=None,
output_type="object",
noise_scale=0.25,
base_shift=0.0,
higher_class_probs=None,
impute_missing=0.0,
disable=None,
categories=(9, 46),
categorical_floor=(0.2, 0.01),
categorical_digits=3,
clip_low=-2.0,
clip_high=2.0,
):
Import
from interpret.utils._synthetic import make_synthetic
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| classes | tuple/list or int or None | No | Class names for classification; None or 0 for regression (default ("class_0", "class_1"))
|
| n_samples | int | No | Number of samples to generate (default 1000) |
| missing | bool or float | No | Whether/how much data to make missing; False for no missing, True for 10%, or a float 0.0-1.0 |
| seed | int | No | Random seed for reproducibility |
| output_type | str | No | Output format: "object", "float", "str", "pandas", "csc_matrix", or "csc_array" |
| noise_scale | float | No | Standard deviation of Gaussian noise added to the response (default 0.25) |
| base_shift | float | No | Mean of the base Gaussian noise (default 0.0) |
| higher_class_probs | array-like | No | Probabilities for multiclass selection above the 0th class |
| impute_missing | float | No | Value used to impute missing values in the response generation (default 0.0) |
| disable | list of str | No | List of effect names to disable (e.g. "cos", "sin", "mains", "pairs", "triples") |
| categories | tuple of int | No | Number of categories for low and high cardinality categorical features (default (9, 46)) |
| categorical_floor | tuple of float | No | Minimum probability floor for categorical distributions (default (0.2, 0.01)) |
| categorical_digits | int | No | Number of digits for categorical value encoding (default 3) |
| clip_low | float | No | Lower bound for feature clipping (default -2.0) |
| clip_high | float | No | Upper bound for feature clipping (default 2.0) |
Outputs
| Name | Type | Description |
|---|---|---|
| X_orig | numpy array, pandas DataFrame, or scipy sparse matrix | Feature matrix in the requested output format |
| y | numpy array | Response vector (continuous for regression, class labels for classification) |
| names | list of str | Feature names |
| types | list of str | Feature types ("continuous" or "nominal") |
Usage Examples
Binary Classification Dataset
from interpret.utils._synthetic import make_synthetic
X, y, names, types = make_synthetic(
classes=("negative", "positive"),
n_samples=500,
seed=42,
output_type="pandas"
)
print(names) # Feature names
print(types) # Feature types
print(y[:10]) # First 10 labels
Regression Dataset with Missing Values
from interpret.utils._synthetic import make_synthetic
X, y, names, types = make_synthetic(
classes=None, # regression mode
n_samples=1000,
missing=0.1, # 10% missing
seed=123,
output_type="float"
)
Disabling Specific Effects
from interpret.utils._synthetic import make_synthetic
# Only main effects, no interactions
X, y, names, types = make_synthetic(
disable=["pairs", "triples"],
seed=42
)