Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Snorkel team Snorkel Augmented Data Combination

From Leeroopedia
Knowledge Sources
Domains Data_Augmentation, Data_Pipeline
Last Updated 2026-02-14 20:00 GMT

Overview

A validation and combination step that merges augmented data with original data and verifies data integrity before downstream training.

Description

Augmented Data Combination is the final step in the augmentation pipeline where original and augmented DataFrames are merged into a single training dataset. This is a user-defined pattern rather than a library API.

Key considerations:

  • Deduplication: Removing duplicate augmented examples
  • Balance checking: Ensuring augmented data does not over-represent certain classes
  • Integrity validation: Verifying that augmented examples maintain correct labels and valid features
  • Ratio control: Limiting the ratio of augmented to original data

Usage

Use this principle after applying transformation functions. Combine original and augmented data, perform validation checks, and prepare the final training dataset.

Theoretical Basis

The combined dataset: 𝒟final=𝒟original𝒟augmented

Quality constraints:

  • |𝒟augmented|r|𝒟original| (ratio bound)
  • Label distribution of 𝒟final should match 𝒟original (balance preservation)

Practical Guide

Since this is a user-defined pattern, here is the recommended approach:

  1. Apply TFs to get augmented DataFrame
  2. Concatenate with original using pd.concat
  3. Shuffle the combined dataset
  4. Optionally deduplicate
  5. Verify label distribution

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment