Principle:Gretelai Gretel synthetics Column Clustering
| Knowledge Sources | |
|---|---|
| Domains | Synthetic_Data, Tabular_Data |
| Last Updated | 2026-02-14 19:00 GMT |
Overview
Column clustering is the technique of partitioning the columns of a tabular dataset into smaller, correlated groups so that each group can be modeled independently by a generative model.
Description
When generating synthetic tabular data with character-level or token-level language models, feeding an entire wide DataFrame (potentially hundreds of columns) into a single model is impractical. Column clustering addresses this by computing a correlation matrix across all columns, converting it into a distance matrix, and applying hierarchical agglomerative clustering (via scipy) to produce a dendrogram. The dendrogram is then traversed top-down: any sub-cluster whose size exceeds a configurable maxsize threshold is split further, while clusters that fit within the threshold are accepted as a batch.
Two additional heuristics refine the clusters:
- Average record length threshold — even if a cluster has fewer than maxsize columns, its serialised record length may be too long for the model's context window. When the average record length of a candidate cluster exceeds this threshold the cluster is split further.
- Complex field isolation — columns that are highly unique (>85 % unique values), long (average length >= 16 characters), and alphanumeric (e.g., UUIDs, hashes) are removed from the correlation analysis and placed into their own single-column batches.
After clustering, a merge pass walks the ordered leaf clusters and greedily merges adjacent clusters as long as the merged group does not violate the size or record-length constraints.
If no explicit cluster assignments are provided, the DataFrameBatch constructor can fall back to a simpler strategy: splitting the column list into equal-sized chunks of batch_size columns using numpy.array_split.
Usage
Use column clustering when:
- The source DataFrame has more columns than a single language model can handle effectively.
- Columns exhibit natural correlation groups that should be kept together to preserve inter-column relationships.
- Certain columns contain complex identifiers that would confuse a shared model and benefit from isolation.
Theoretical Basis
The clustering pipeline rests on the following steps:
1. Correlation computation
A mixed-type correlation matrix is computed using Cramer's V for categorical pairs, correlation ratio for categorical-numeric pairs, and Pearson correlation for numeric pairs. The result is an n x n symmetric matrix C where C[i][j] is in [0, 1].
2. Distance matrix
The distance matrix is derived as:
X = 1 - np.array(1 - abs(corr_matrix))
This converts correlation magnitudes into distances suitable for hierarchical clustering.
3. Hierarchical clustering and optimal leaf ordering
L = sch.linkage(X, method=method) # e.g., method="single"
Lopt = sch.optimal_leaf_ordering(L, X) # minimize adjacent-leaf distances
4. Top-down traversal
Starting from the root of the dendrogram, each node is checked. If its child cluster size exceeds maxsize or its average record length exceeds the threshold, the child is recursively split. Otherwise the child's leaf set is accepted as a cluster.
5. Greedy merge
Adjacent leaf clusters are merged as long as the merged group's column count does not exceed maxsize and its average record length stays below the threshold.
Pseudocode:
function cluster(df, maxsize, arl_threshold, method):
isolate complex columns into single-column batches
C = compute_correlation_matrix(df)
X = to_distance_matrix(C)
L = hierarchical_linkage(X, method)
Lopt = optimal_leaf_ordering(L, X)
raw_clusters = traverse_top_down(Lopt, maxsize, arl_threshold)
merged = greedy_merge(raw_clusters, maxsize, arl_threshold)
return merged + single_column_batches