Principle:Snorkel team Snorkel Augmented Data Combination

Knowledge Sources	A Survey on Data Augmentation for Text Classification
Domains	Data_Augmentation, Data_Pipeline
Last Updated	2026-02-14 20:00 GMT

Overview

A validation and combination step that merges augmented data with original data and verifies data integrity before downstream training.

Description

Augmented Data Combination is the final step in the augmentation pipeline where original and augmented DataFrames are merged into a single training dataset. This is a user-defined pattern rather than a library API.

Key considerations:

Deduplication: Removing duplicate augmented examples
Balance checking: Ensuring augmented data does not over-represent certain classes
Integrity validation: Verifying that augmented examples maintain correct labels and valid features
Ratio control: Limiting the ratio of augmented to original data

Usage

Use this principle after applying transformation functions. Combine original and augmented data, perform validation checks, and prepare the final training dataset.

Theoretical Basis

The combined dataset: $𝒟_{final} = 𝒟_{original} \cup 𝒟_{augmented}$

Quality constraints:

$| 𝒟_{augmented} | \leq r \cdot | 𝒟_{original} |$ (ratio bound)
Label distribution of $𝒟_{final}$ should match $𝒟_{original}$ (balance preservation)

Practical Guide

Since this is a user-defined pattern, here is the recommended approach:

Apply TFs to get augmented DataFrame
Concatenate with original using pd.concat
Shuffle the combined dataset
Optionally deduplicate
Verify label distribution

Related Pages

Implemented By

Implementation:Snorkel_team_Snorkel_Augmented_Data_Combination_Pattern

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment