Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Interpretml Interpret Synthetic Dataset Generation

From Leeroopedia


Knowledge Sources
Domains Machine_Learning, Interpretability
Last Updated 2026-02-07 12:00 GMT

Overview

make_synthetic is a utility function that generates synthetic datasets with known ground-truth feature effects (including main effects, pairwise interactions, and three-way interactions) for testing and benchmarking interpretable machine learning models.

Description

This module provides a comprehensive synthetic data generation framework designed specifically for testing EBM (Explainable Boosting Machine) and other interpretable models:

  • make_synthetic: The primary function that generates a dataset with 10 features (8 continuous, 2 categorical) and a response variable. The response is constructed from known additive terms:
    • Main effects: cosine, sine, parabola, linear integer, square wave, sawtooth wave, and exponential transformations on individual features, plus categorical feature effects
    • Pairwise interactions: XOR-like interaction, multiplication of continuous features, multiplication of continuous with categorical
    • Three-way interaction: Product of continuous, integer, and categorical features
    • Unused feature: Feature 7 is deliberately unused to test whether models correctly identify irrelevant features
    • For classification tasks, the continuous response is converted to class labels via a logistic transformation with optional multiclass support

The module also includes helper functions:

  • _make_synthetic_features: Generates the feature matrix with various distributions (uniform, normal, exponential, Poisson, correlated features)
  • _normalize_categoricals: Converts categorical features to numeric for response generation
  • _check_synthetic_dataset: Diagnostic function for inspecting generated data

Usage

Use make_synthetic when you need a dataset with known ground-truth effects for validating that an interpretable model correctly recovers feature shapes, interactions, and identifies unused features. Supports multiple output formats including numpy arrays, pandas DataFrames, and scipy sparse matrices.

Code Reference

Source Location

Signature

def make_synthetic(
    classes=("class_0", "class_1"),
    n_samples=1000,
    missing=False,
    seed=None,
    output_type="object",
    noise_scale=0.25,
    base_shift=0.0,
    higher_class_probs=None,
    impute_missing=0.0,
    disable=None,
    categories=(9, 46),
    categorical_floor=(0.2, 0.01),
    categorical_digits=3,
    clip_low=-2.0,
    clip_high=2.0,
):

Import

from interpret.utils._synthetic import make_synthetic

I/O Contract

Inputs

Name Type Required Description
classes tuple/list or int or None No Class names for classification; None or 0 for regression (default ("class_0", "class_1"))
n_samples int No Number of samples to generate (default 1000)
missing bool or float No Whether/how much data to make missing; False for no missing, True for 10%, or a float 0.0-1.0
seed int No Random seed for reproducibility
output_type str No Output format: "object", "float", "str", "pandas", "csc_matrix", or "csc_array"
noise_scale float No Standard deviation of Gaussian noise added to the response (default 0.25)
base_shift float No Mean of the base Gaussian noise (default 0.0)
higher_class_probs array-like No Probabilities for multiclass selection above the 0th class
impute_missing float No Value used to impute missing values in the response generation (default 0.0)
disable list of str No List of effect names to disable (e.g. "cos", "sin", "mains", "pairs", "triples")
categories tuple of int No Number of categories for low and high cardinality categorical features (default (9, 46))
categorical_floor tuple of float No Minimum probability floor for categorical distributions (default (0.2, 0.01))
categorical_digits int No Number of digits for categorical value encoding (default 3)
clip_low float No Lower bound for feature clipping (default -2.0)
clip_high float No Upper bound for feature clipping (default 2.0)

Outputs

Name Type Description
X_orig numpy array, pandas DataFrame, or scipy sparse matrix Feature matrix in the requested output format
y numpy array Response vector (continuous for regression, class labels for classification)
names list of str Feature names
types list of str Feature types ("continuous" or "nominal")

Usage Examples

Binary Classification Dataset

from interpret.utils._synthetic import make_synthetic

X, y, names, types = make_synthetic(
    classes=("negative", "positive"),
    n_samples=500,
    seed=42,
    output_type="pandas"
)
print(names)   # Feature names
print(types)   # Feature types
print(y[:10])  # First 10 labels

Regression Dataset with Missing Values

from interpret.utils._synthetic import make_synthetic

X, y, names, types = make_synthetic(
    classes=None,         # regression mode
    n_samples=1000,
    missing=0.1,          # 10% missing
    seed=123,
    output_type="float"
)

Disabling Specific Effects

from interpret.utils._synthetic import make_synthetic

# Only main effects, no interactions
X, y, names, types = make_synthetic(
    disable=["pairs", "triples"],
    seed=42
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment