Implementation:Snorkel team Snorkel Augmented Data Combination Pattern

Knowledge Sources	Snorkel
Domains	Data_Augmentation, Data_Pipeline
Last Updated	2026-02-14 20:00 GMT

Overview

User-defined pattern for combining original and augmented DataFrames after transformation function application.

Description

This is a Pattern Doc rather than an API Doc. There is no Snorkel library function for this step; users implement the combination logic using standard pandas operations. The pattern involves concatenating DataFrames, shuffling, and optionally deduplicating.

Interface Specification

def combine_augmented_data(
    original_df: pd.DataFrame,
    augmented_df: pd.DataFrame,
    shuffle: bool = True,
    deduplicate: bool = False,
    max_ratio: Optional[float] = None,
) -> pd.DataFrame:
    """
    Combine original and augmented data.

    Args:
        original_df: Original training DataFrame.
        augmented_df: Augmented DataFrame from PandasTFApplier.
        shuffle: Whether to shuffle the combined result.
        deduplicate: Whether to remove duplicate rows.
        max_ratio: Maximum ratio of augmented to original rows.
    Returns:
        Combined DataFrame ready for training.
    """
    ...

Code Reference

Source Location

Repository: N/A (user-defined pattern)
File: N/A

Import

import pandas as pd

I/O Contract

Inputs

Name	Type	Required	Description
original_df	pd.DataFrame	Yes	Original training data
augmented_df	pd.DataFrame	Yes	Augmented data from PandasTFApplier

Outputs

Name	Type	Description
combined_df	pd.DataFrame	Merged DataFrame ready for downstream training

Usage Examples

Basic Combination

import pandas as pd

# After augmentation
df_augmented = applier.apply(df_train)

# Combine (augmented already includes originals if keep_original=True)
df_combined = df_augmented.sample(frac=1.0, random_state=42).reset_index(drop=True)

print(f"Original: {len(df_train)}, Combined: {len(df_combined)}")
print(f"Label distribution:\n{df_combined['label'].value_counts(normalize=True)}")

With Deduplication and Ratio Control

# If keep_original=False in policy, concatenate manually
df_combined = pd.concat([df_train, df_augmented], ignore_index=True)

# Remove exact duplicates
df_combined = df_combined.drop_duplicates(subset=["text"])

# Limit augmented ratio
max_augmented = len(df_train) * 2  # 2x ratio
if len(df_combined) > max_augmented + len(df_train):
    df_combined = df_combined.head(max_augmented + len(df_train))

# Shuffle
df_combined = df_combined.sample(frac=1.0).reset_index(drop=True)

Related Pages

Implements Principle

Principle:Snorkel_team_Snorkel_Augmented_Data_Combination

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment