Implementation:Gretelai Gretel synthetics DataFrameBatch Init
| Knowledge Sources | |
|---|---|
| Domains | Synthetic_Data, Tabular_Data |
| Last Updated | 2026-02-14 19:00 GMT |
Overview
Concrete tool for initializing and partitioning a DataFrame into column-clustered batches provided by the gretel-synthetics library.
Description
The DataFrameBatch.__init__ method is the entry point for the entire batch synthesis workflow. In write mode it accepts a source DataFrame and a configuration template, partitions the columns into batches (either via explicit batch_headers or by automatic equal-size splitting), and creates a directory structure under the checkpoint directory with one sub-directory per batch. Each sub-directory receives its own Batch dataclass instance containing the column headers, a dedicated config, and file paths for training data and model checkpoints.
In read mode it reconstructs the batch dictionary from a previously written checkpoint directory, loading headers, configs, and any saved validators from disk, and optionally validates the loaded models by running a single-line generation test.
The companion header_clusters.cluster() function provides correlation-based column clustering as an alternative to uniform splitting. It computes a mixed-type correlation matrix, performs hierarchical agglomerative clustering with optimal leaf ordering, and returns a list of column-name lists that can be passed directly as the batch_headers parameter.
Usage
Use DataFrameBatch.__init__ at the start of any batch synthesis pipeline to set up directory structure and partition columns. Use header_clusters.cluster() when you want correlation-aware column grouping rather than naive equal-size splits.
Code Reference
Source Location
- Repository: gretel-synthetics
- File:
src/gretel_synthetics/batch.py(lines 952-1127) andsrc/gretel_synthetics/utils/header_clusters.py(lines 230-342)
Signature
# DataFrameBatch constructor
class DataFrameBatch:
def __init__(
self,
*,
df: pd.DataFrame = None,
batch_size: int = BATCH_SIZE,
batch_headers: List[List[str]] = None,
config: Union[dict, BaseConfig] = None,
tokenizer: BaseTokenizerTrainer = None,
mode: str = WRITE,
checkpoint_dir: str = None,
validate_model: bool = True,
):
# header_clusters.cluster function
def cluster(
df: pd.DataFrame,
header_prefix: List[str] = None,
maxsize: int = 20,
average_record_length_threshold: float = 0,
method: str = "single",
numeric_cat: List[str] = None,
plot: bool = False,
isolate_complex_field: bool = True,
) -> List[List[str]]:
Import
from gretel_synthetics.batch import DataFrameBatch
from gretel_synthetics.utils.header_clusters import cluster
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| df | pd.DataFrame | Yes (write mode) | The source DataFrame to partition into batches. |
| batch_size | int | No (default 15) | Max number of columns per batch when auto-splitting. |
| batch_headers | List[List[str]] | No | Explicit column groupings; overrides batch_size splitting. |
| config | Union[dict, BaseConfig] | Yes | Template config containing at minimum checkpoint_dir and field_delimiter. |
| tokenizer | BaseTokenizerTrainer | No | Optional custom tokenizer trainer class. |
| mode | str | No (default "write") | Either "write" (create new batches) or "read" (load existing). |
| checkpoint_dir | str | No | Required only in read mode when config is not a dict. |
| validate_model | bool | No (default True) | In read mode, run a generation test to validate loaded models. |
header_clusters.cluster() inputs:
| Name | Type | Required | Description |
|---|---|---|---|
| df | pd.DataFrame | Yes | DataFrame whose columns are to be clustered. |
| header_prefix | List[str] | No | Columns to strip before clustering and prepend to first cluster. |
| maxsize | int | No (default 20) | Maximum number of columns allowed in a single cluster. |
| average_record_length_threshold | float | No (default 0) | Record-length cap per cluster; 0 disables. |
| method | str | No (default "single") | Scipy linkage method for hierarchical clustering. |
| numeric_cat | List[str] | No | Additional columns to treat as categorical. |
| plot | bool | No (default False) | If True, display a dendrogram. |
| isolate_complex_field | bool | No (default True) | Isolate high-uniqueness alphanumeric columns into their own batches. |
Outputs
| Name | Type | Description |
|---|---|---|
| DataFrameBatch instance | DataFrameBatch | Fully initialized object with batches dict, master_header_list, and on-disk directory structure. |
| cluster() return | List[List[str]] | List of column-name lists, each list representing one batch. |
Usage Examples
Basic Example: Uniform Splitting
from gretel_synthetics.batch import DataFrameBatch
config = {
"checkpoint_dir": "/tmp/my_model",
"field_delimiter": ",",
"overwrite": True,
}
batcher = DataFrameBatch(df=my_dataframe, batch_size=10, config=config)
# batcher.batches now maps batch indices to Batch objects
print(f"Number of batches: {len(batcher.batches)}")
Advanced Example: Correlation-Based Clustering
from gretel_synthetics.batch import DataFrameBatch
from gretel_synthetics.utils.header_clusters import cluster
# Compute correlation-aware column clusters
clusters = cluster(my_dataframe, maxsize=15, method="single")
config = {
"checkpoint_dir": "/tmp/my_model",
"field_delimiter": ",",
"overwrite": True,
}
batcher = DataFrameBatch(
df=my_dataframe,
batch_headers=clusters,
config=config,
)