Heuristic:Scikit learn contrib Imbalanced learn Sparse Matrix Handling
| Knowledge Sources | |
|---|---|
| Domains | Imbalanced_Classification, Performance |
| Last Updated | 2026-02-09 03:00 GMT |
Overview
SMOTEN converts sparse matrices to dense internally with a performance warning; SMOTE preserves sparse format (CSR/CSC) in output.
Description
Sparse matrix handling varies across SMOTE variants. Standard SMOTE and its derivatives (BorderlineSMOTE, SVMSMOTE, KMeansSMOTE) support sparse input and preserve the original sparse format in the output. However, SMOTEN (for purely categorical data) must convert sparse matrices to dense arrays internally for ordinal encoding and the Value Difference Metric computation, issuing a `DataConversionWarning`. After processing, SMOTEN converts back to the original sparse format (CSR or CSC) if the input was sparse.
Usage
Apply this heuristic when working with sparse feature matrices (common in text classification and one-hot encoded categorical data). If using SMOTEN on sparse data, expect a performance penalty from the dense conversion. Consider SMOTE-NC for mixed feature types as it handles the categorical/numerical split more efficiently.
The Insight (Rule of Thumb)
- Action: Use standard SMOTE for sparse numerical data (efficient, preserves format). Avoid SMOTEN with large sparse matrices. Use SMOTE-NC for mixed categorical/numerical features.
- Value: SMOTEN dense conversion can significantly increase memory usage for high-dimensional sparse data.
- Trade-off: SMOTEN provides better synthetic sample quality for categorical data but at the cost of memory efficiency on sparse input.
- Format preservation: SMOTEN preserves the original sparse format (CSR or CSC) in the output, so downstream code expecting sparse input will still work.
Reasoning
SMOTEN requires ordinal encoding and the Value Difference Metric for categorical distance computation. Both operations require dense array access patterns. Standard SMOTE operates on numerical features using Euclidean distance in the nearest-neighbor search, which scipy's sparse-aware kNN implementations handle efficiently.
Additionally, SMOTE-NC requires both categorical and numerical features. Passing only categorical features to SMOTE-NC raises a `ValueError` ("SMOTE-NC is not designed to work only with numerical features. It requires some categorical features."), and passing only numerical features is also rejected. For purely categorical datasets, SMOTEN is the only option despite the sparse conversion cost.
Code Evidence
SMOTEN sparse conversion warning from `imblearn/over_sampling/_smote/base.py:923-935`:
def _fit_resample(self, X, y):
if sparse.issparse(X):
X_sparse_format = X.format
X = X.toarray()
warnings.warn(
(
"Passing a sparse matrix to SMOTEN is not really efficient since it"
" is converted to a dense array internally."
),
DataConversionWarning,
)
else:
X_sparse_format = None
Sparse format restoration from `imblearn/over_sampling/_smote/base.py:974-979`:
if X_sparse_format == "csr":
return sparse.csr_matrix(X_resampled), y_resampled
elif X_sparse_format == "csc":
return sparse.csc_matrix(X_resampled), y_resampled
else:
return X_resampled, y_resampled
SMOTE-NC mixed feature requirement from `imblearn/over_sampling/_smote/base.py:585-592`:
elif self.categorical_features_.size == 0:
raise ValueError(
"SMOTE-NC is not designed to work only with numerical "
"features. It requires some categorical features."
)