Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Allenai Open instruct Contamination Indexer

From Leeroopedia


Knowledge Sources
Domains Data_Quality, Evaluation
Last Updated 2026-02-07 02:00 GMT

Overview

Concrete tool for indexing HuggingFace training datasets into Elasticsearch with text-based and dense vector indexing strategies for contamination detection.

Description

The index.py module is the first step of the decontamination pipeline. It reads datasets from HuggingFace Hub (or from a YAML dataset mixer config), extracts messages matching a configurable role filter (e.g., user messages only), and indexes them into Elasticsearch. For text indexing, it creates an index with a custom tulu_analyzer that uses regex-based tokenization splitting on whitespace and select punctuation (preserving math operators and code tokens). For vector indexing, it creates a dense_vector index with 4096-dimensional embeddings using dot_product similarity, encodes documents with an embedding model (default: NV-Embed-v2), and bulk-inserts normalized embeddings.

Usage

Use this module before running search.py to build the Elasticsearch index of training data. It supports both text-based and semantic vector indexing depending on the contamination detection strategy needed.

Code Reference

Source Location

Signature

def create_text_index(es: Elasticsearch, index_name: str) -> None:
    """Create text index with custom tulu_analyzer."""

def create_vector_index(es: Elasticsearch, index_name: str) -> None:
    """Create dense_vector index with 4096 dims and dot_product similarity."""

def read_dataset(dataset_name: str, split: str, messages_field: str,
                 query_filter: str, query_field: str) -> list[dict]:
    """Load HF dataset and extract messages matching role filter."""

def index_dataset_text(data_to_index: list, es: Elasticsearch,
                       index_name: str, text_batch_size: int) -> None:
    """Bulk index text data with batching."""

def index_dataset_vectors(data_to_index: list, es: Elasticsearch,
                          index_name: str, model_name: str,
                          max_batch_tokens: int) -> None:
    """Encode text to embeddings and bulk index vectors."""

def main() -> None:
    """CLI entry point for dataset indexing."""

Import

# CLI script, run directly:
# python decontamination/index.py --dataset_name <name> --index_name <name> --index_type text|vector

I/O Contract

Inputs

Name Type Required Description
dataset_name str Yes HuggingFace dataset name to index
index_name str Yes Name for the Elasticsearch index
index_type str Yes Type of index: text or vector
messages_field str No Field name containing messages (default: messages)
query_filter str No Role filter in format key:value (default: role:user)

Outputs

Name Type Description
Elasticsearch index ES Index Populated text or vector index for contamination search

Usage Examples

Text Indexing

# Index a HuggingFace dataset into Elasticsearch for text-based search
python decontamination/index.py \
  --dataset_name allenai/tulu-v3-sft-mixture \
  --index_name tulu3_training_text \
  --index_type text \
  --split train \
  --messages_field messages \
  --query_filter role:user

Vector Indexing

# Index using dense vector embeddings for semantic matching
python decontamination/index.py \
  --dataset_name allenai/tulu-v3-sft-mixture \
  --index_name tulu3_training_vectors \
  --index_type vector \
  --model_name nvidia/NV-Embed-v2

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment