Implementation:FlagOpen FlagEmbedding Search Demo Preprocess
| Knowledge Sources | |
|---|---|
| Domains | Search_Demo, Index_Building, Embedding_Generation |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
⚠️ LEGACY CODE: This file is in `research/old-examples/` and is superseded by the main FlagEmbedding inference APIs.
Preprocessing pipeline for building BM25 and dense embedding indices from Wikipedia corpus for search demonstrations.
Description
This script prepares a Wikipedia-based search system by creating both sparse and dense retrieval indices:
EmbDataset handles document batching:
- Loads corpus from JSON with document contents
- Tokenizes documents with padding/truncation to 512 tokens
- Processes documents in batches for efficient embedding generation
inference() generates dense embeddings:
- Uses DataParallel for multi-GPU embedding generation
- Extracts [CLS] token embeddings and L2-normalizes them
- Saves embeddings as NumPy memmap for efficient loading
- Processes large corpora in batches with progress tracking
build_bm25_index() creates Lucene index:
- Converts corpus to Pyserini-compatible JSON format (title + text)
- Builds Lucene index with document vectors and positional information
- Uses 8 threads for parallel indexing
- Stores raw documents for later retrieval
The script uses the Chinese Wikipedia (Cohere/wikipedia-22-12) by default and creates a complete search infrastructure with both sparse (BM25) and dense (embedding) retrieval capabilities.
Usage
Use this to set up a complete search demo system with hybrid retrieval capabilities on Wikipedia or similar document collections.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/old-examples/search_demo/pre_process.py
- Lines: 1-112
Signature
class EmbDataset(Dataset):
def __init__(self, tokenizer: PreTrainedTokenizer, path: str)
def __getitem__(self, item)
def inference(json_path, emb_path, model_path)
def build_bm25_index(dataset, collection_path, index_path)
Import
from research.old_examples.search_demo.pre_process import inference, build_bm25_index
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_path | str | Yes | Path to embedding model |
| dataset | Dataset | Yes | HuggingFace dataset with title and text fields |
| json_path | str | Yes | Path to documents JSON |
| collection_path | str | Yes | Directory for Pyserini collection |
| index_path | str | Yes | Directory for BM25 index |
| emb_path | str | Yes | Path to save embedding .npy file |
Outputs
| Name | Type | Description |
|---|---|---|
| embeddings | .npy file | Dense embeddings, shape [num_docs, hidden_dim] |
| bm25_index | Directory | Lucene index for BM25 search |
| collection | JSON | Documents in Pyserini format |
Usage Examples
from datasets import load_dataset
from research.old_examples.search_demo.pre_process import inference, build_bm25_index
import os
# Set up directories
data_path = "./search_demo_data"
dataset_path = os.path.join(data_path, 'dataset')
collection_path = os.path.join(data_path, 'collection')
index_path = os.path.join(data_path, 'index')
emb_path = os.path.join(data_path, 'emb')
os.makedirs(dataset_path, exist_ok=True)
os.makedirs(collection_path, exist_ok=True)
os.makedirs(index_path, exist_ok=True)
os.makedirs(emb_path, exist_ok=True)
# Load Wikipedia dataset
dataset = load_dataset("Cohere/wikipedia-22-12", 'zh', split='train')
dataset.save_to_disk(dataset_path)
# Build BM25 index
build_bm25_index(dataset, collection_path, index_path)
# Creates: collection/documents.json and index/ directory
# Generate dense embeddings
inference(
json_path=os.path.join(collection_path, 'documents.json'),
emb_path=os.path.join(emb_path, 'data.npy'),
model_path="BAAI/bge-base-zh-v1.5"
)
# Creates: emb/data.npy with shape [num_docs, 768]