Implementation:Gretelai Gretel synthetics DataFrameBatch Init

Knowledge Sources	gretel-synthetics
Domains	Synthetic_Data, Tabular_Data
Last Updated	2026-02-14 19:00 GMT

Overview

Concrete tool for initializing and partitioning a DataFrame into column-clustered batches provided by the gretel-synthetics library.

Description

The DataFrameBatch.__init__ method is the entry point for the entire batch synthesis workflow. In write mode it accepts a source DataFrame and a configuration template, partitions the columns into batches (either via explicit batch_headers or by automatic equal-size splitting), and creates a directory structure under the checkpoint directory with one sub-directory per batch. Each sub-directory receives its own Batch dataclass instance containing the column headers, a dedicated config, and file paths for training data and model checkpoints.

In read mode it reconstructs the batch dictionary from a previously written checkpoint directory, loading headers, configs, and any saved validators from disk, and optionally validates the loaded models by running a single-line generation test.

The companion header_clusters.cluster() function provides correlation-based column clustering as an alternative to uniform splitting. It computes a mixed-type correlation matrix, performs hierarchical agglomerative clustering with optimal leaf ordering, and returns a list of column-name lists that can be passed directly as the batch_headers parameter.

Usage

Use DataFrameBatch.__init__ at the start of any batch synthesis pipeline to set up directory structure and partition columns. Use header_clusters.cluster() when you want correlation-aware column grouping rather than naive equal-size splits.

Code Reference

Source Location

Repository: gretel-synthetics
File: src/gretel_synthetics/batch.py (lines 952-1127) and src/gretel_synthetics/utils/header_clusters.py (lines 230-342)

Signature

# DataFrameBatch constructor
class DataFrameBatch:
    def __init__(
        self,
        *,
        df: pd.DataFrame = None,
        batch_size: int = BATCH_SIZE,
        batch_headers: List[List[str]] = None,
        config: Union[dict, BaseConfig] = None,
        tokenizer: BaseTokenizerTrainer = None,
        mode: str = WRITE,
        checkpoint_dir: str = None,
        validate_model: bool = True,
    ):

# header_clusters.cluster function
def cluster(
    df: pd.DataFrame,
    header_prefix: List[str] = None,
    maxsize: int = 20,
    average_record_length_threshold: float = 0,
    method: str = "single",
    numeric_cat: List[str] = None,
    plot: bool = False,
    isolate_complex_field: bool = True,
) -> List[List[str]]:

Import

from gretel_synthetics.batch import DataFrameBatch
from gretel_synthetics.utils.header_clusters import cluster

I/O Contract

Inputs

Name	Type	Required	Description
df	pd.DataFrame	Yes (write mode)	The source DataFrame to partition into batches.
batch_size	int	No (default 15)	Max number of columns per batch when auto-splitting.
batch_headers	List[List[str]]	No	Explicit column groupings; overrides batch_size splitting.
config	Union[dict, BaseConfig]	Yes	Template config containing at minimum checkpoint_dir and field_delimiter.
tokenizer	BaseTokenizerTrainer	No	Optional custom tokenizer trainer class.
mode	str	No (default "write")	Either "write" (create new batches) or "read" (load existing).
checkpoint_dir	str	No	Required only in read mode when config is not a dict.
validate_model	bool	No (default True)	In read mode, run a generation test to validate loaded models.

header_clusters.cluster() inputs:

Name	Type	Required	Description
df	pd.DataFrame	Yes	DataFrame whose columns are to be clustered.
header_prefix	List[str]	No	Columns to strip before clustering and prepend to first cluster.
maxsize	int	No (default 20)	Maximum number of columns allowed in a single cluster.
average_record_length_threshold	float	No (default 0)	Record-length cap per cluster; 0 disables.
method	str	No (default "single")	Scipy linkage method for hierarchical clustering.
numeric_cat	List[str]	No	Additional columns to treat as categorical.
plot	bool	No (default False)	If True, display a dendrogram.
isolate_complex_field	bool	No (default True)	Isolate high-uniqueness alphanumeric columns into their own batches.

Outputs

Name	Type	Description
DataFrameBatch instance	DataFrameBatch	Fully initialized object with batches dict, master_header_list, and on-disk directory structure.
cluster() return	List[List[str]]	List of column-name lists, each list representing one batch.

Usage Examples

Basic Example: Uniform Splitting

from gretel_synthetics.batch import DataFrameBatch

config = {
    "checkpoint_dir": "/tmp/my_model",
    "field_delimiter": ",",
    "overwrite": True,
}

batcher = DataFrameBatch(df=my_dataframe, batch_size=10, config=config)
# batcher.batches now maps batch indices to Batch objects
print(f"Number of batches: {len(batcher.batches)}")

Advanced Example: Correlation-Based Clustering

from gretel_synthetics.batch import DataFrameBatch
from gretel_synthetics.utils.header_clusters import cluster

# Compute correlation-aware column clusters
clusters = cluster(my_dataframe, maxsize=15, method="single")

config = {
    "checkpoint_dir": "/tmp/my_model",
    "field_delimiter": ",",
    "overwrite": True,
}

batcher = DataFrameBatch(
    df=my_dataframe,
    batch_headers=clusters,
    config=config,
)

Related Pages

Implements Principle

Principle:Gretelai_Gretel_synthetics_Column_Clustering

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment