Implementation:FlagOpen FlagEmbedding Search Demo Preprocess

Knowledge Sources	FlagOpen_FlagEmbedding
Domains	Search_Demo, Index_Building, Embedding_Generation
Last Updated	2026-02-09 00:00 GMT

Overview

⚠️ LEGACY CODE: This file is in `research/old-examples/` and is superseded by the main FlagEmbedding inference APIs.

Preprocessing pipeline for building BM25 and dense embedding indices from Wikipedia corpus for search demonstrations.

Description

This script prepares a Wikipedia-based search system by creating both sparse and dense retrieval indices:

EmbDataset handles document batching:

Loads corpus from JSON with document contents
Tokenizes documents with padding/truncation to 512 tokens
Processes documents in batches for efficient embedding generation

inference() generates dense embeddings:

Uses DataParallel for multi-GPU embedding generation
Extracts [CLS] token embeddings and L2-normalizes them
Saves embeddings as NumPy memmap for efficient loading
Processes large corpora in batches with progress tracking

build_bm25_index() creates Lucene index:

Converts corpus to Pyserini-compatible JSON format (title + text)
Builds Lucene index with document vectors and positional information
Uses 8 threads for parallel indexing
Stores raw documents for later retrieval

The script uses the Chinese Wikipedia (Cohere/wikipedia-22-12) by default and creates a complete search infrastructure with both sparse (BM25) and dense (embedding) retrieval capabilities.

Usage

Use this to set up a complete search demo system with hybrid retrieval capabilities on Wikipedia or similar document collections.

Code Reference

Source Location

Repository: FlagOpen_FlagEmbedding
File: research/old-examples/search_demo/pre_process.py
Lines: 1-112

Signature

class EmbDataset(Dataset):
    def __init__(self, tokenizer: PreTrainedTokenizer, path: str)
    def __getitem__(self, item)

def inference(json_path, emb_path, model_path)

def build_bm25_index(dataset, collection_path, index_path)

Import

from research.old_examples.search_demo.pre_process import inference, build_bm25_index

I/O Contract

Inputs

Name	Type	Required	Description
model_path	str	Yes	Path to embedding model
dataset	Dataset	Yes	HuggingFace dataset with title and text fields
json_path	str	Yes	Path to documents JSON
collection_path	str	Yes	Directory for Pyserini collection
index_path	str	Yes	Directory for BM25 index
emb_path	str	Yes	Path to save embedding .npy file

Outputs

Name	Type	Description
embeddings	.npy file	Dense embeddings, shape [num_docs, hidden_dim]
bm25_index	Directory	Lucene index for BM25 search
collection	JSON	Documents in Pyserini format

Usage Examples

from datasets import load_dataset
from research.old_examples.search_demo.pre_process import inference, build_bm25_index
import os

# Set up directories
data_path = "./search_demo_data"
dataset_path = os.path.join(data_path, 'dataset')
collection_path = os.path.join(data_path, 'collection')
index_path = os.path.join(data_path, 'index')
emb_path = os.path.join(data_path, 'emb')

os.makedirs(dataset_path, exist_ok=True)
os.makedirs(collection_path, exist_ok=True)
os.makedirs(index_path, exist_ok=True)
os.makedirs(emb_path, exist_ok=True)

# Load Wikipedia dataset
dataset = load_dataset("Cohere/wikipedia-22-12", 'zh', split='train')
dataset.save_to_disk(dataset_path)

# Build BM25 index
build_bm25_index(dataset, collection_path, index_path)
# Creates: collection/documents.json and index/ directory

# Generate dense embeddings
inference(
    json_path=os.path.join(collection_path, 'documents.json'),
    emb_path=os.path.join(emb_path, 'data.npy'),
    model_path="BAAI/bge-base-zh-v1.5"
)
# Creates: emb/data.npy with shape [num_docs, 768]

Related Pages

Heuristic:FlagOpen_FlagEmbedding_Warning_Deprecated_Search_Demo

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment