Implementation:AUTOMATIC1111 Stable diffusion webui PersonalizedBase for textual inversion
| Knowledge Sources | |
|---|---|
| Domains | Textual Inversion, Dataset, Training Data, Stable Diffusion |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for constructing a PyTorch Dataset that pre-encodes images to latent space, applies caption templates with placeholder tokens, and supports tag shuffling and dropout for textual inversion training, provided by the AUTOMATIC1111 stable-diffusion-webui repository.
Description
PersonalizedBase is a PyTorch Dataset subclass that handles the full pipeline from raw image files to training-ready DatasetEntry objects. During initialization, it:
- Reads a template file containing prompt patterns with
[name]and[filewords]placeholders - Iterates over all images in
data_root, loading and resizing them to the specifiedwidth x height(unlessvarsize=True) - Optionally extracts alpha channels for per-pixel loss weighting
- Pre-encodes each image through the VAE encoder to obtain latent representations, using the chosen
latent_sampling_method("once","deterministic", or"random") - Reads per-image caption text from companion
.txtfiles or derives it from filenames - Groups images by resolution for variable-size batching via
GroupedBatchSampler
At access time (__getitem__), it applies tag shuffling, tag dropout, and random latent resampling as configured.
Usage
Use this dataset class when:
- Setting up the data pipeline for textual inversion embedding training
- You need pre-encoded latents to reduce VRAM usage during training
- You want to apply caption augmentation (tag shuffling, dropout) during training
- Working with variable-resolution images that require aspect-ratio bucketing
Code Reference
Source Location
- Repository: stable-diffusion-webui
- File:
modules/textual_inversion/dataset.py - Lines: L32-173
Signature
class PersonalizedBase(Dataset):
def __init__(
self,
data_root,
width,
height,
repeats,
flip_p=0.5,
placeholder_token="*",
model=None,
cond_model=None,
device=None,
template_file=None,
include_cond=False,
batch_size=1,
gradient_step=1,
shuffle_tags=False,
tag_drop_out=0,
latent_sampling_method='once',
varsize=False,
use_weight=False
):
Import
from modules.textual_inversion.dataset import PersonalizedBase
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_root | str | Yes | Path to directory containing training images (and optional companion .txt caption files)
|
| width | int | Yes | Target image width in pixels for resizing (ignored if varsize=True)
|
| height | int | Yes | Target image height in pixels for resizing (ignored if varsize=True)
|
| repeats | int | Yes | Number of times to repeat the dataset per epoch (stored but used by the training loop for epoch length calculation) |
| flip_p | float | No | Probability of random horizontal flip augmentation; defaults to 0.5
|
| placeholder_token | str | No | The token name to substitute for [name] in templates; defaults to "*"
|
| model | object | No | The Stable Diffusion model, used for VAE encoding via encode_first_stage
|
| cond_model | object | No | The CLIP conditioning model, used for pre-computing text conditions when include_cond=True
|
| device | torch.device | No | Target device for tensor operations during encoding |
| template_file | str | No | Path to the prompt template text file containing one template per line |
| include_cond | bool | No | If True, pre-computes CLIP text embeddings during dataset construction; defaults to False
|
| batch_size | int | No | Batch size for training; clamped to dataset length; defaults to 1
|
| gradient_step | int | No | Gradient accumulation steps; clamped based on dataset and batch size; defaults to 1
|
| shuffle_tags | bool | No | If True, randomly shuffles comma-separated tags in captions at each access; defaults to False
|
| tag_drop_out | float | No | Probability of dropping each individual tag from captions; 0 means no dropout; defaults to 0
|
| latent_sampling_method | str | No | One of "once", "deterministic", or "random"; controls how latents are sampled from the VAE posterior; defaults to "once"
|
| varsize | bool | No | If True, preserves original image aspect ratios and groups by resolution; defaults to False
|
| use_weight | bool | No | If True, extracts alpha channels as per-pixel loss weights; defaults to False
|
Outputs
| Name | Type | Description |
|---|---|---|
| entry | DatasetEntry | A DatasetEntry with fields: filename, filename_text, latent_sample (or latent_dist if random), cond_text, weight
|
Usage Examples
Basic Usage
from modules.textual_inversion.dataset import PersonalizedBase, PersonalizedDataLoader
ds = PersonalizedBase(
data_root="/path/to/training/images",
width=512,
height=512,
repeats=100,
placeholder_token="my-concept",
model=shared.sd_model,
cond_model=shared.sd_model.cond_stage_model,
device=devices.device,
template_file="/path/to/template.txt",
batch_size=2,
gradient_step=1,
shuffle_tags=True,
tag_drop_out=0.1,
latent_sampling_method="once"
)
dl = PersonalizedDataLoader(ds, latent_sampling_method="once", batch_size=ds.batch_size)
for batch in dl:
print(batch.cond_text) # list of caption strings
print(batch.latent_sample) # stacked latent tensors
break
Variable-Size Bucketing
ds = PersonalizedBase(
data_root="/path/to/variable_size_images",
width=512,
height=512,
repeats=100,
placeholder_token="my-style",
model=shared.sd_model,
cond_model=shared.sd_model.cond_stage_model,
device=devices.device,
template_file="/path/to/template.txt",
varsize=True, # preserve original aspect ratios
batch_size=4
)
# Images are grouped into buckets by resolution; GroupedBatchSampler
# ensures each batch contains same-sized images