Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:FlagOpen FlagEmbedding Search Demo Preprocess

From Leeroopedia


Knowledge Sources
Domains Search_Demo, Index_Building, Embedding_Generation
Last Updated 2026-02-09 00:00 GMT

Overview

⚠️ LEGACY CODE: This file is in `research/old-examples/` and is superseded by the main FlagEmbedding inference APIs.

Preprocessing pipeline for building BM25 and dense embedding indices from Wikipedia corpus for search demonstrations.

Description

This script prepares a Wikipedia-based search system by creating both sparse and dense retrieval indices:

EmbDataset handles document batching:

  • Loads corpus from JSON with document contents
  • Tokenizes documents with padding/truncation to 512 tokens
  • Processes documents in batches for efficient embedding generation

inference() generates dense embeddings:

  • Uses DataParallel for multi-GPU embedding generation
  • Extracts [CLS] token embeddings and L2-normalizes them
  • Saves embeddings as NumPy memmap for efficient loading
  • Processes large corpora in batches with progress tracking

build_bm25_index() creates Lucene index:

  • Converts corpus to Pyserini-compatible JSON format (title + text)
  • Builds Lucene index with document vectors and positional information
  • Uses 8 threads for parallel indexing
  • Stores raw documents for later retrieval

The script uses the Chinese Wikipedia (Cohere/wikipedia-22-12) by default and creates a complete search infrastructure with both sparse (BM25) and dense (embedding) retrieval capabilities.

Usage

Use this to set up a complete search demo system with hybrid retrieval capabilities on Wikipedia or similar document collections.

Code Reference

Source Location

Signature

class EmbDataset(Dataset):
    def __init__(self, tokenizer: PreTrainedTokenizer, path: str)
    def __getitem__(self, item)

def inference(json_path, emb_path, model_path)

def build_bm25_index(dataset, collection_path, index_path)

Import

from research.old_examples.search_demo.pre_process import inference, build_bm25_index

I/O Contract

Inputs

Name Type Required Description
model_path str Yes Path to embedding model
dataset Dataset Yes HuggingFace dataset with title and text fields
json_path str Yes Path to documents JSON
collection_path str Yes Directory for Pyserini collection
index_path str Yes Directory for BM25 index
emb_path str Yes Path to save embedding .npy file

Outputs

Name Type Description
embeddings .npy file Dense embeddings, shape [num_docs, hidden_dim]
bm25_index Directory Lucene index for BM25 search
collection JSON Documents in Pyserini format

Usage Examples

from datasets import load_dataset
from research.old_examples.search_demo.pre_process import inference, build_bm25_index
import os

# Set up directories
data_path = "./search_demo_data"
dataset_path = os.path.join(data_path, 'dataset')
collection_path = os.path.join(data_path, 'collection')
index_path = os.path.join(data_path, 'index')
emb_path = os.path.join(data_path, 'emb')

os.makedirs(dataset_path, exist_ok=True)
os.makedirs(collection_path, exist_ok=True)
os.makedirs(index_path, exist_ok=True)
os.makedirs(emb_path, exist_ok=True)

# Load Wikipedia dataset
dataset = load_dataset("Cohere/wikipedia-22-12", 'zh', split='train')
dataset.save_to_disk(dataset_path)

# Build BM25 index
build_bm25_index(dataset, collection_path, index_path)
# Creates: collection/documents.json and index/ directory

# Generate dense embeddings
inference(
    json_path=os.path.join(collection_path, 'documents.json'),
    emb_path=os.path.join(emb_path, 'data.npy'),
    model_path="BAAI/bge-base-zh-v1.5"
)
# Creates: emb/data.npy with shape [num_docs, 768]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment