Implementation:AUTOMATIC1111 Stable diffusion webui PersonalizedBase for textual inversion

Knowledge Sources	stable-diffusion-webui
Domains	Textual Inversion, Dataset, Training Data, Stable Diffusion
Last Updated	2026-02-08 00:00 GMT

Overview

Concrete tool for constructing a PyTorch Dataset that pre-encodes images to latent space, applies caption templates with placeholder tokens, and supports tag shuffling and dropout for textual inversion training, provided by the AUTOMATIC1111 stable-diffusion-webui repository.

Description

PersonalizedBase is a PyTorch Dataset subclass that handles the full pipeline from raw image files to training-ready DatasetEntry objects. During initialization, it:

Reads a template file containing prompt patterns with [name] and [filewords] placeholders
Iterates over all images in data_root, loading and resizing them to the specified width x height (unless varsize=True)
Optionally extracts alpha channels for per-pixel loss weighting
Pre-encodes each image through the VAE encoder to obtain latent representations, using the chosen latent_sampling_method ("once", "deterministic", or "random")
Reads per-image caption text from companion .txt files or derives it from filenames
Groups images by resolution for variable-size batching via GroupedBatchSampler

At access time (__getitem__), it applies tag shuffling, tag dropout, and random latent resampling as configured.

Usage

Use this dataset class when:

Setting up the data pipeline for textual inversion embedding training
You need pre-encoded latents to reduce VRAM usage during training
You want to apply caption augmentation (tag shuffling, dropout) during training
Working with variable-resolution images that require aspect-ratio bucketing

Code Reference

Source Location

Repository: stable-diffusion-webui
File: modules/textual_inversion/dataset.py
Lines: L32-173

Signature

class PersonalizedBase(Dataset):
    def __init__(
        self,
        data_root,
        width,
        height,
        repeats,
        flip_p=0.5,
        placeholder_token="*",
        model=None,
        cond_model=None,
        device=None,
        template_file=None,
        include_cond=False,
        batch_size=1,
        gradient_step=1,
        shuffle_tags=False,
        tag_drop_out=0,
        latent_sampling_method='once',
        varsize=False,
        use_weight=False
    ):

Import

from modules.textual_inversion.dataset import PersonalizedBase

I/O Contract

Inputs

Name	Type	Required	Description
data_root	str	Yes	Path to directory containing training images (and optional companion `.txt` caption files)
width	int	Yes	Target image width in pixels for resizing (ignored if `varsize=True`)
height	int	Yes	Target image height in pixels for resizing (ignored if `varsize=True`)
repeats	int	Yes	Number of times to repeat the dataset per epoch (stored but used by the training loop for epoch length calculation)
flip_p	float	No	Probability of random horizontal flip augmentation; defaults to `0.5`
placeholder_token	str	No	The token name to substitute for `[name]` in templates; defaults to `"*"`
model	object	No	The Stable Diffusion model, used for VAE encoding via `encode_first_stage`
cond_model	object	No	The CLIP conditioning model, used for pre-computing text conditions when `include_cond=True`
device	torch.device	No	Target device for tensor operations during encoding
template_file	str	No	Path to the prompt template text file containing one template per line
include_cond	bool	No	If True, pre-computes CLIP text embeddings during dataset construction; defaults to `False`
batch_size	int	No	Batch size for training; clamped to dataset length; defaults to `1`
gradient_step	int	No	Gradient accumulation steps; clamped based on dataset and batch size; defaults to `1`
shuffle_tags	bool	No	If True, randomly shuffles comma-separated tags in captions at each access; defaults to `False`
tag_drop_out	float	No	Probability of dropping each individual tag from captions; `0` means no dropout; defaults to `0`
latent_sampling_method	str	No	One of `"once"`, `"deterministic"`, or `"random"`; controls how latents are sampled from the VAE posterior; defaults to `"once"`
varsize	bool	No	If True, preserves original image aspect ratios and groups by resolution; defaults to `False`
use_weight	bool	No	If True, extracts alpha channels as per-pixel loss weights; defaults to `False`

Outputs

Name	Type	Description
entry	DatasetEntry	A `DatasetEntry` with fields: `filename`, `filename_text`, `latent_sample` (or `latent_dist` if random), `cond_text`, `weight`

Usage Examples

Basic Usage

from modules.textual_inversion.dataset import PersonalizedBase, PersonalizedDataLoader

ds = PersonalizedBase(
    data_root="/path/to/training/images",
    width=512,
    height=512,
    repeats=100,
    placeholder_token="my-concept",
    model=shared.sd_model,
    cond_model=shared.sd_model.cond_stage_model,
    device=devices.device,
    template_file="/path/to/template.txt",
    batch_size=2,
    gradient_step=1,
    shuffle_tags=True,
    tag_drop_out=0.1,
    latent_sampling_method="once"
)

dl = PersonalizedDataLoader(ds, latent_sampling_method="once", batch_size=ds.batch_size)

for batch in dl:
    print(batch.cond_text)       # list of caption strings
    print(batch.latent_sample)   # stacked latent tensors
    break

Variable-Size Bucketing

ds = PersonalizedBase(
    data_root="/path/to/variable_size_images",
    width=512,
    height=512,
    repeats=100,
    placeholder_token="my-style",
    model=shared.sd_model,
    cond_model=shared.sd_model.cond_stage_model,
    device=devices.device,
    template_file="/path/to/template.txt",
    varsize=True,  # preserve original aspect ratios
    batch_size=4
)
# Images are grouped into buckets by resolution; GroupedBatchSampler
# ensures each batch contains same-sized images

Related Pages

Implements Principle

Principle:AUTOMATIC1111_Stable_diffusion_webui_Training_dataset_preparation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment