Implementation:Interpretml Interpret Synthetic Dataset Generation

Knowledge Sources	Interpretml Interpret
Domains	Machine_Learning, Interpretability
Last Updated	2026-02-07 12:00 GMT

Overview

make_synthetic is a utility function that generates synthetic datasets with known ground-truth feature effects (including main effects, pairwise interactions, and three-way interactions) for testing and benchmarking interpretable machine learning models.

Description

This module provides a comprehensive synthetic data generation framework designed specifically for testing EBM (Explainable Boosting Machine) and other interpretable models:

make_synthetic: The primary function that generates a dataset with 10 features (8 continuous, 2 categorical) and a response variable. The response is constructed from known additive terms:
- Main effects: cosine, sine, parabola, linear integer, square wave, sawtooth wave, and exponential transformations on individual features, plus categorical feature effects
- Pairwise interactions: XOR-like interaction, multiplication of continuous features, multiplication of continuous with categorical
- Three-way interaction: Product of continuous, integer, and categorical features
- Unused feature: Feature 7 is deliberately unused to test whether models correctly identify irrelevant features
- For classification tasks, the continuous response is converted to class labels via a logistic transformation with optional multiclass support

The module also includes helper functions:

_make_synthetic_features: Generates the feature matrix with various distributions (uniform, normal, exponential, Poisson, correlated features)
_normalize_categoricals: Converts categorical features to numeric for response generation
_check_synthetic_dataset: Diagnostic function for inspecting generated data

Usage

Use make_synthetic when you need a dataset with known ground-truth effects for validating that an interpretable model correctly recovers feature shapes, interactions, and identifies unused features. Supports multiple output formats including numpy arrays, pandas DataFrames, and scipy sparse matrices.

Code Reference

Source Location

Repository: Interpretml_Interpret
File: python/interpret-core/interpret/utils/_synthetic.py

Signature

def make_synthetic(
    classes=("class_0", "class_1"),
    n_samples=1000,
    missing=False,
    seed=None,
    output_type="object",
    noise_scale=0.25,
    base_shift=0.0,
    higher_class_probs=None,
    impute_missing=0.0,
    disable=None,
    categories=(9, 46),
    categorical_floor=(0.2, 0.01),
    categorical_digits=3,
    clip_low=-2.0,
    clip_high=2.0,
):

Import

from interpret.utils._synthetic import make_synthetic

I/O Contract

Inputs

Name	Type	Required	Description
classes	tuple/list or int or None	No	Class names for classification; None or 0 for regression (default `("class_0", "class_1")`)
n_samples	int	No	Number of samples to generate (default 1000)
missing	bool or float	No	Whether/how much data to make missing; False for no missing, True for 10%, or a float 0.0-1.0
seed	int	No	Random seed for reproducibility
output_type	str	No	Output format: "object", "float", "str", "pandas", "csc_matrix", or "csc_array"
noise_scale	float	No	Standard deviation of Gaussian noise added to the response (default 0.25)
base_shift	float	No	Mean of the base Gaussian noise (default 0.0)
higher_class_probs	array-like	No	Probabilities for multiclass selection above the 0th class
impute_missing	float	No	Value used to impute missing values in the response generation (default 0.0)
disable	list of str	No	List of effect names to disable (e.g. "cos", "sin", "mains", "pairs", "triples")
categories	tuple of int	No	Number of categories for low and high cardinality categorical features (default (9, 46))
categorical_floor	tuple of float	No	Minimum probability floor for categorical distributions (default (0.2, 0.01))
categorical_digits	int	No	Number of digits for categorical value encoding (default 3)
clip_low	float	No	Lower bound for feature clipping (default -2.0)
clip_high	float	No	Upper bound for feature clipping (default 2.0)

Outputs

Name	Type	Description
X_orig	numpy array, pandas DataFrame, or scipy sparse matrix	Feature matrix in the requested output format
y	numpy array	Response vector (continuous for regression, class labels for classification)
names	list of str	Feature names
types	list of str	Feature types ("continuous" or "nominal")

Usage Examples

Binary Classification Dataset

from interpret.utils._synthetic import make_synthetic

X, y, names, types = make_synthetic(
    classes=("negative", "positive"),
    n_samples=500,
    seed=42,
    output_type="pandas"
)
print(names)   # Feature names
print(types)   # Feature types
print(y[:10])  # First 10 labels

Regression Dataset with Missing Values

from interpret.utils._synthetic import make_synthetic

X, y, names, types = make_synthetic(
    classes=None,         # regression mode
    n_samples=1000,
    missing=0.1,          # 10% missing
    seed=123,
    output_type="float"
)

Disabling Specific Effects

from interpret.utils._synthetic import make_synthetic

# Only main effects, no interactions
X, y, names, types = make_synthetic(
    disable=["pairs", "triples"],
    seed=42
)

Related Pages

Principle:Interpretml_Interpret_Synthetic_Data

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment