Implementation:Allenai Open instruct Contamination Search
| Knowledge Sources | |
|---|---|
| Domains | Data_Quality, Evaluation |
| Last Updated | 2026-02-07 02:00 GMT |
Overview
Concrete tool for searching Elasticsearch indices of training data to detect and quantify test set contamination using exact, n-gram, and vector matching strategies.
Description
The search.py module is the analysis step of the decontamination pipeline. It queries Elasticsearch indices (built by index.py) using evaluation test sets to compute per-instance contamination scores. Three matching strategies are supported: (1) exact match using Elasticsearch match_phrase, (2) n-gram matching using spaCy tokenization with coverage scoring, and (3) semantic vector matching using transformer embeddings with KNN search. Results are output as per-instance JSONL files and a summary TSV contamination report. When no evaluation dataset is specified, it defaults to the full Tulu 3 evaluation suite (20 benchmarks). A decontamination mode can produce filtered training datasets with contaminated instances removed.
Usage
Use this module when you need to check a training dataset for contamination against evaluation benchmarks. Run it after indexing training data with index.py. It is essential for maintaining evaluation integrity in instruction tuning research.
Code Reference
Source Location
- Repository: Allenai_Open_instruct
- File: decontamination/search.py
- Lines: 1-374
Signature
def prepare_embedding_model(model_name: str) -> tuple:
"""Load transformer model and tokenizer for vector matching."""
def get_ngram_mapping(string: str, n: int) -> dict:
"""Create n-gram to token index mapping using spaCy."""
def exact_match(es, index_name, query_dataset, fields, search_size) -> tuple:
"""Search for exact phrase matches, return (scores, data, train_indices)."""
def ngram_match(es, index_name, query_dataset, fields, ngram_size, search_size) -> tuple:
"""Search for n-gram overlap, return (scores, data, max_train_scores)."""
def vector_match(es, index_name, query_dataset, fields, model, tokenizer,
max_batch_tokens, search_size) -> tuple:
"""Search for semantic similarity via KNN, return (scores, data, max_train_scores)."""
def main() -> None:
"""CLI entry point for contamination search."""
Import
# CLI script, run directly:
# python decontamination/search.py --index_name <name> --match_type exact|ngram|vector
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| index_name | str | Yes | Elasticsearch index to search against |
| match_type | str | Yes | Matching strategy: exact, ngram, or vector |
| query_dataset | Dataset | No | Evaluation dataset (defaults to Tulu 3 suite) |
| match_threshold | float | No | Score threshold for binary contamination |
| decontaminate | bool | No | Whether to output filtered datasets |
Outputs
| Name | Type | Description |
|---|---|---|
| JSONL files | File | Per-instance contamination scores and matching details |
| TSV report | File | Summary contamination rates per evaluation benchmark |
| Filtered parquets | File | Decontaminated training datasets (if decontaminate mode) |
Usage Examples
Exact Match Search
# Run exact match contamination search against the full Tulu 3 eval suite
# python decontamination/search.py \
# --index_name tulu3_training_data \
# --match_type exact \
# --output_dir ./contamination_results
N-gram Match with Decontamination
# Run n-gram matching and produce filtered datasets
# python decontamination/search.py \
# --index_name tulu3_training_data \
# --match_type ngram \
# --ngram_size 13 \
# --match_threshold 0.7 \
# --decontaminate \
# --output_dir ./decontaminated_output