Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Gretelai Gretel synthetics DataFrameBatch Init

From Leeroopedia
Knowledge Sources
Domains Synthetic_Data, Tabular_Data
Last Updated 2026-02-14 19:00 GMT

Overview

Concrete tool for initializing and partitioning a DataFrame into column-clustered batches provided by the gretel-synthetics library.

Description

The DataFrameBatch.__init__ method is the entry point for the entire batch synthesis workflow. In write mode it accepts a source DataFrame and a configuration template, partitions the columns into batches (either via explicit batch_headers or by automatic equal-size splitting), and creates a directory structure under the checkpoint directory with one sub-directory per batch. Each sub-directory receives its own Batch dataclass instance containing the column headers, a dedicated config, and file paths for training data and model checkpoints.

In read mode it reconstructs the batch dictionary from a previously written checkpoint directory, loading headers, configs, and any saved validators from disk, and optionally validates the loaded models by running a single-line generation test.

The companion header_clusters.cluster() function provides correlation-based column clustering as an alternative to uniform splitting. It computes a mixed-type correlation matrix, performs hierarchical agglomerative clustering with optimal leaf ordering, and returns a list of column-name lists that can be passed directly as the batch_headers parameter.

Usage

Use DataFrameBatch.__init__ at the start of any batch synthesis pipeline to set up directory structure and partition columns. Use header_clusters.cluster() when you want correlation-aware column grouping rather than naive equal-size splits.

Code Reference

Source Location

  • Repository: gretel-synthetics
  • File: src/gretel_synthetics/batch.py (lines 952-1127) and src/gretel_synthetics/utils/header_clusters.py (lines 230-342)

Signature

# DataFrameBatch constructor
class DataFrameBatch:
    def __init__(
        self,
        *,
        df: pd.DataFrame = None,
        batch_size: int = BATCH_SIZE,
        batch_headers: List[List[str]] = None,
        config: Union[dict, BaseConfig] = None,
        tokenizer: BaseTokenizerTrainer = None,
        mode: str = WRITE,
        checkpoint_dir: str = None,
        validate_model: bool = True,
    ):

# header_clusters.cluster function
def cluster(
    df: pd.DataFrame,
    header_prefix: List[str] = None,
    maxsize: int = 20,
    average_record_length_threshold: float = 0,
    method: str = "single",
    numeric_cat: List[str] = None,
    plot: bool = False,
    isolate_complex_field: bool = True,
) -> List[List[str]]:

Import

from gretel_synthetics.batch import DataFrameBatch
from gretel_synthetics.utils.header_clusters import cluster

I/O Contract

Inputs

Name Type Required Description
df pd.DataFrame Yes (write mode) The source DataFrame to partition into batches.
batch_size int No (default 15) Max number of columns per batch when auto-splitting.
batch_headers List[List[str]] No Explicit column groupings; overrides batch_size splitting.
config Union[dict, BaseConfig] Yes Template config containing at minimum checkpoint_dir and field_delimiter.
tokenizer BaseTokenizerTrainer No Optional custom tokenizer trainer class.
mode str No (default "write") Either "write" (create new batches) or "read" (load existing).
checkpoint_dir str No Required only in read mode when config is not a dict.
validate_model bool No (default True) In read mode, run a generation test to validate loaded models.

header_clusters.cluster() inputs:

Name Type Required Description
df pd.DataFrame Yes DataFrame whose columns are to be clustered.
header_prefix List[str] No Columns to strip before clustering and prepend to first cluster.
maxsize int No (default 20) Maximum number of columns allowed in a single cluster.
average_record_length_threshold float No (default 0) Record-length cap per cluster; 0 disables.
method str No (default "single") Scipy linkage method for hierarchical clustering.
numeric_cat List[str] No Additional columns to treat as categorical.
plot bool No (default False) If True, display a dendrogram.
isolate_complex_field bool No (default True) Isolate high-uniqueness alphanumeric columns into their own batches.

Outputs

Name Type Description
DataFrameBatch instance DataFrameBatch Fully initialized object with batches dict, master_header_list, and on-disk directory structure.
cluster() return List[List[str]] List of column-name lists, each list representing one batch.

Usage Examples

Basic Example: Uniform Splitting

from gretel_synthetics.batch import DataFrameBatch

config = {
    "checkpoint_dir": "/tmp/my_model",
    "field_delimiter": ",",
    "overwrite": True,
}

batcher = DataFrameBatch(df=my_dataframe, batch_size=10, config=config)
# batcher.batches now maps batch indices to Batch objects
print(f"Number of batches: {len(batcher.batches)}")

Advanced Example: Correlation-Based Clustering

from gretel_synthetics.batch import DataFrameBatch
from gretel_synthetics.utils.header_clusters import cluster

# Compute correlation-aware column clusters
clusters = cluster(my_dataframe, maxsize=15, method="single")

config = {
    "checkpoint_dir": "/tmp/my_model",
    "field_delimiter": ",",
    "overwrite": True,
}

batcher = DataFrameBatch(
    df=my_dataframe,
    batch_headers=clusters,
    config=config,
)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment