Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Snorkel team Snorkel Augmented Data Combination Pattern

From Leeroopedia
Revision as of 13:51, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Snorkel_team_Snorkel_Augmented_Data_Combination_Pattern.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Data_Augmentation, Data_Pipeline
Last Updated 2026-02-14 20:00 GMT

Overview

User-defined pattern for combining original and augmented DataFrames after transformation function application.

Description

This is a Pattern Doc rather than an API Doc. There is no Snorkel library function for this step; users implement the combination logic using standard pandas operations. The pattern involves concatenating DataFrames, shuffling, and optionally deduplicating.

Interface Specification

def combine_augmented_data(
    original_df: pd.DataFrame,
    augmented_df: pd.DataFrame,
    shuffle: bool = True,
    deduplicate: bool = False,
    max_ratio: Optional[float] = None,
) -> pd.DataFrame:
    """
    Combine original and augmented data.

    Args:
        original_df: Original training DataFrame.
        augmented_df: Augmented DataFrame from PandasTFApplier.
        shuffle: Whether to shuffle the combined result.
        deduplicate: Whether to remove duplicate rows.
        max_ratio: Maximum ratio of augmented to original rows.
    Returns:
        Combined DataFrame ready for training.
    """
    ...

Code Reference

Source Location

  • Repository: N/A (user-defined pattern)
  • File: N/A

Import

import pandas as pd

I/O Contract

Inputs

Name Type Required Description
original_df pd.DataFrame Yes Original training data
augmented_df pd.DataFrame Yes Augmented data from PandasTFApplier

Outputs

Name Type Description
combined_df pd.DataFrame Merged DataFrame ready for downstream training

Usage Examples

Basic Combination

import pandas as pd

# After augmentation
df_augmented = applier.apply(df_train)

# Combine (augmented already includes originals if keep_original=True)
df_combined = df_augmented.sample(frac=1.0, random_state=42).reset_index(drop=True)

print(f"Original: {len(df_train)}, Combined: {len(df_combined)}")
print(f"Label distribution:\n{df_combined['label'].value_counts(normalize=True)}")

With Deduplication and Ratio Control

# If keep_original=False in policy, concatenate manually
df_combined = pd.concat([df_train, df_augmented], ignore_index=True)

# Remove exact duplicates
df_combined = df_combined.drop_duplicates(subset=["text"])

# Limit augmented ratio
max_augmented = len(df_train) * 2  # 2x ratio
if len(df_combined) > max_augmented + len(df_train):
    df_combined = df_combined.head(max_augmented + len(df_train))

# Shuffle
df_combined = df_combined.sample(frac=1.0).reset_index(drop=True)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment