Workflow:Snorkel team Snorkel Data Augmentation

Knowledge Sources	Snorkel Snorkel Documentation
Domains	Data_Augmentation, Training_Data, Machine_Learning
Last Updated	2026-02-14 20:00 GMT

Overview

End-to-end process for programmatically augmenting training datasets by defining transformation functions and applying them through configurable policies to generate additional training examples.

Description

This workflow enables systematic data augmentation using Snorkel's operator pattern. Users define transformation functions (TFs) that specify atomic transformations to data points (e.g., synonym replacement, random deletion, image rotation). A policy determines which sequences of TFs to apply to each data point and how many augmented copies to generate. The TF applier orchestrates the process, applying the policy-generated TF sequences to each data point and collecting the transformed outputs into an augmented dataset. TFs build on the same Mapper infrastructure as labeling functions, inheriting preprocessing and caching capabilities.

Usage

Execute this workflow when you need to increase the size or diversity of your training dataset. This is appropriate when you have limited labeled data and can define meaningful transformations that preserve the label (e.g., text paraphrasing, image augmentations). The output is an augmented DataFrame or list of data points that can be combined with the original data for training.

Execution Steps

Step 1: Define Transformation Functions

Create transformation functions that specify atomic data transformations. Each TF takes a data point and returns either a transformed copy or None (if the transformation is not applicable). TFs inherit from the Mapper class and can use the decorator syntax for simple transformations or be subclassed for complex logic.

Key considerations:

Each TF should represent a single, atomic transformation
Return None if the transformation cannot be applied to a given data point
TFs should preserve the data point's label (label-preserving transformations)
Use the transformation_function decorator for simple function-based TFs
TF names must be unique within the applier

Step 2: Configure Augmentation Policy

Create a policy that determines how TFs are sequenced and how many augmented examples to generate per original data point. The base Policy class supports configuring the number of transformed copies (n_per_original) and whether to keep the original data point (keep_original). Sampling policies (RandomPolicy, MeanFieldPolicy) provide stochastic TF sequencing strategies.

Key considerations:

n_per_original controls how many augmented copies to create per data point
keep_original (default True) includes the untransformed data point in the output
RandomPolicy samples TF sequences uniformly at random
MeanFieldPolicy uses per-TF probabilities to generate sequences independently
The sequence_length parameter controls how many TFs are chained per augmented example

Step 3: Apply Transformations

Execute the TFs according to the policy using PandasTFApplier (for DataFrames) or the base TFApplier (for lists). The applier generates TF sequences from the policy for each data point, applies them in order, and collects all successfully transformed data points into the augmented dataset.

Key considerations:

PandasTFApplier.apply returns a DataFrame containing all augmented data points
Use apply_generator for memory-efficient batch processing in training loops
If all TFs in a sequence return None, the augmented copy is dropped
The original data points are included when keep_original is True in the policy
Progress bars can be enabled for monitoring long-running augmentation

Step 4: Combine and Validate Augmented Data

Merge the augmented data with the original training data and validate that the augmented examples are well-formed. Ensure labels are properly propagated to augmented examples and that the augmented dataset maintains the expected schema and data quality.

Key considerations:

Augmented data inherits labels from the original data points
Check that TFs did not introduce data quality issues (missing fields, invalid values)
The augmented dataset may have significantly more rows than the original
Consider balancing the ratio of original to augmented examples if needed

Execution Diagram

GitHub URL

Workflow Repository