Implementation:Snorkel team Snorkel Augmented Data Combination Pattern
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Data_Augmentation, Data_Pipeline |
| Last Updated | 2026-02-14 20:00 GMT |
Overview
User-defined pattern for combining original and augmented DataFrames after transformation function application.
Description
This is a Pattern Doc rather than an API Doc. There is no Snorkel library function for this step; users implement the combination logic using standard pandas operations. The pattern involves concatenating DataFrames, shuffling, and optionally deduplicating.
Interface Specification
def combine_augmented_data(
original_df: pd.DataFrame,
augmented_df: pd.DataFrame,
shuffle: bool = True,
deduplicate: bool = False,
max_ratio: Optional[float] = None,
) -> pd.DataFrame:
"""
Combine original and augmented data.
Args:
original_df: Original training DataFrame.
augmented_df: Augmented DataFrame from PandasTFApplier.
shuffle: Whether to shuffle the combined result.
deduplicate: Whether to remove duplicate rows.
max_ratio: Maximum ratio of augmented to original rows.
Returns:
Combined DataFrame ready for training.
"""
...
Code Reference
Source Location
- Repository: N/A (user-defined pattern)
- File: N/A
Import
import pandas as pd
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| original_df | pd.DataFrame | Yes | Original training data |
| augmented_df | pd.DataFrame | Yes | Augmented data from PandasTFApplier |
Outputs
| Name | Type | Description |
|---|---|---|
| combined_df | pd.DataFrame | Merged DataFrame ready for downstream training |
Usage Examples
Basic Combination
import pandas as pd
# After augmentation
df_augmented = applier.apply(df_train)
# Combine (augmented already includes originals if keep_original=True)
df_combined = df_augmented.sample(frac=1.0, random_state=42).reset_index(drop=True)
print(f"Original: {len(df_train)}, Combined: {len(df_combined)}")
print(f"Label distribution:\n{df_combined['label'].value_counts(normalize=True)}")
With Deduplication and Ratio Control
# If keep_original=False in policy, concatenate manually
df_combined = pd.concat([df_train, df_augmented], ignore_index=True)
# Remove exact duplicates
df_combined = df_combined.drop_duplicates(subset=["text"])
# Limit augmented ratio
max_augmented = len(df_train) * 2 # 2x ratio
if len(df_combined) > max_augmented + len(df_train):
df_combined = df_combined.head(max_augmented + len(df_train))
# Shuffle
df_combined = df_combined.sample(frac=1.0).reset_index(drop=True)
Related Pages
Implements Principle
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment