Implementation:Allenai Open instruct Contamination Indexer
| Knowledge Sources | |
|---|---|
| Domains | Data_Quality, Evaluation |
| Last Updated | 2026-02-07 02:00 GMT |
Overview
Concrete tool for indexing HuggingFace training datasets into Elasticsearch with text-based and dense vector indexing strategies for contamination detection.
Description
The index.py module is the first step of the decontamination pipeline. It reads datasets from HuggingFace Hub (or from a YAML dataset mixer config), extracts messages matching a configurable role filter (e.g., user messages only), and indexes them into Elasticsearch. For text indexing, it creates an index with a custom tulu_analyzer that uses regex-based tokenization splitting on whitespace and select punctuation (preserving math operators and code tokens). For vector indexing, it creates a dense_vector index with 4096-dimensional embeddings using dot_product similarity, encodes documents with an embedding model (default: NV-Embed-v2), and bulk-inserts normalized embeddings.
Usage
Use this module before running search.py to build the Elasticsearch index of training data. It supports both text-based and semantic vector indexing depending on the contamination detection strategy needed.
Code Reference
Source Location
- Repository: Allenai_Open_instruct
- File: decontamination/index.py
- Lines: 1-218
Signature
def create_text_index(es: Elasticsearch, index_name: str) -> None:
"""Create text index with custom tulu_analyzer."""
def create_vector_index(es: Elasticsearch, index_name: str) -> None:
"""Create dense_vector index with 4096 dims and dot_product similarity."""
def read_dataset(dataset_name: str, split: str, messages_field: str,
query_filter: str, query_field: str) -> list[dict]:
"""Load HF dataset and extract messages matching role filter."""
def index_dataset_text(data_to_index: list, es: Elasticsearch,
index_name: str, text_batch_size: int) -> None:
"""Bulk index text data with batching."""
def index_dataset_vectors(data_to_index: list, es: Elasticsearch,
index_name: str, model_name: str,
max_batch_tokens: int) -> None:
"""Encode text to embeddings and bulk index vectors."""
def main() -> None:
"""CLI entry point for dataset indexing."""
Import
# CLI script, run directly:
# python decontamination/index.py --dataset_name <name> --index_name <name> --index_type text|vector
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dataset_name | str | Yes | HuggingFace dataset name to index |
| index_name | str | Yes | Name for the Elasticsearch index |
| index_type | str | Yes | Type of index: text or vector |
| messages_field | str | No | Field name containing messages (default: messages) |
| query_filter | str | No | Role filter in format key:value (default: role:user) |
Outputs
| Name | Type | Description |
|---|---|---|
| Elasticsearch index | ES Index | Populated text or vector index for contamination search |
Usage Examples
Text Indexing
# Index a HuggingFace dataset into Elasticsearch for text-based search
python decontamination/index.py \
--dataset_name allenai/tulu-v3-sft-mixture \
--index_name tulu3_training_text \
--index_type text \
--split train \
--messages_field messages \
--query_filter role:user
Vector Indexing
# Index using dense vector embeddings for semantic matching
python decontamination/index.py \
--dataset_name allenai/tulu-v3-sft-mixture \
--index_name tulu3_training_vectors \
--index_type vector \
--model_name nvidia/NV-Embed-v2