Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Allenai Open instruct Contamination Search

From Leeroopedia


Knowledge Sources
Domains Data_Quality, Evaluation
Last Updated 2026-02-07 02:00 GMT

Overview

Concrete tool for searching Elasticsearch indices of training data to detect and quantify test set contamination using exact, n-gram, and vector matching strategies.

Description

The search.py module is the analysis step of the decontamination pipeline. It queries Elasticsearch indices (built by index.py) using evaluation test sets to compute per-instance contamination scores. Three matching strategies are supported: (1) exact match using Elasticsearch match_phrase, (2) n-gram matching using spaCy tokenization with coverage scoring, and (3) semantic vector matching using transformer embeddings with KNN search. Results are output as per-instance JSONL files and a summary TSV contamination report. When no evaluation dataset is specified, it defaults to the full Tulu 3 evaluation suite (20 benchmarks). A decontamination mode can produce filtered training datasets with contaminated instances removed.

Usage

Use this module when you need to check a training dataset for contamination against evaluation benchmarks. Run it after indexing training data with index.py. It is essential for maintaining evaluation integrity in instruction tuning research.

Code Reference

Source Location

Signature

def prepare_embedding_model(model_name: str) -> tuple:
    """Load transformer model and tokenizer for vector matching."""

def get_ngram_mapping(string: str, n: int) -> dict:
    """Create n-gram to token index mapping using spaCy."""

def exact_match(es, index_name, query_dataset, fields, search_size) -> tuple:
    """Search for exact phrase matches, return (scores, data, train_indices)."""

def ngram_match(es, index_name, query_dataset, fields, ngram_size, search_size) -> tuple:
    """Search for n-gram overlap, return (scores, data, max_train_scores)."""

def vector_match(es, index_name, query_dataset, fields, model, tokenizer,
                 max_batch_tokens, search_size) -> tuple:
    """Search for semantic similarity via KNN, return (scores, data, max_train_scores)."""

def main() -> None:
    """CLI entry point for contamination search."""

Import

# CLI script, run directly:
# python decontamination/search.py --index_name <name> --match_type exact|ngram|vector

I/O Contract

Inputs

Name Type Required Description
index_name str Yes Elasticsearch index to search against
match_type str Yes Matching strategy: exact, ngram, or vector
query_dataset Dataset No Evaluation dataset (defaults to Tulu 3 suite)
match_threshold float No Score threshold for binary contamination
decontaminate bool No Whether to output filtered datasets

Outputs

Name Type Description
JSONL files File Per-instance contamination scores and matching details
TSV report File Summary contamination rates per evaluation benchmark
Filtered parquets File Decontaminated training datasets (if decontaminate mode)

Usage Examples

Exact Match Search

# Run exact match contamination search against the full Tulu 3 eval suite
# python decontamination/search.py \
#   --index_name tulu3_training_data \
#   --match_type exact \
#   --output_dir ./contamination_results

N-gram Match with Decontamination

# Run n-gram matching and produce filtered datasets
# python decontamination/search.py \
#   --index_name tulu3_training_data \
#   --match_type ngram \
#   --ngram_size 13 \
#   --match_threshold 0.7 \
#   --decontaminate \
#   --output_dir ./decontaminated_output

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment