Principle:Snorkel team Snorkel Augmented Data Combination
| Knowledge Sources | |
|---|---|
| Domains | Data_Augmentation, Data_Pipeline |
| Last Updated | 2026-02-14 20:00 GMT |
Overview
A validation and combination step that merges augmented data with original data and verifies data integrity before downstream training.
Description
Augmented Data Combination is the final step in the augmentation pipeline where original and augmented DataFrames are merged into a single training dataset. This is a user-defined pattern rather than a library API.
Key considerations:
- Deduplication: Removing duplicate augmented examples
- Balance checking: Ensuring augmented data does not over-represent certain classes
- Integrity validation: Verifying that augmented examples maintain correct labels and valid features
- Ratio control: Limiting the ratio of augmented to original data
Usage
Use this principle after applying transformation functions. Combine original and augmented data, perform validation checks, and prepare the final training dataset.
Theoretical Basis
The combined dataset:
Quality constraints:
- (ratio bound)
- Label distribution of should match (balance preservation)
Practical Guide
Since this is a user-defined pattern, here is the recommended approach:
- Apply TFs to get augmented DataFrame
- Concatenate with original using pd.concat
- Shuffle the combined dataset
- Optionally deduplicate
- Verify label distribution