Implementation:FlagOpen FlagEmbedding BGE Coder FAISS Search
| Knowledge Sources | |
|---|---|
| Domains | Vector Search, FAISS, Code Retrieval |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
FAISS-based vector search utilities for code retrieval with embedding generation and similarity filtering.
Description
This module provides helper functions for creating FAISS indices and performing vector similarity search on code embeddings. It includes functionality for creating flat inner product FAISS indices with optional GPU acceleration, batch searching through large embedding collections, and finding top-k similar documents with Jaccard similarity filtering to avoid near-duplicate results. The module integrates with FlagEmbedding's FlagModel for generating code embeddings and supports GPU-accelerated search using FAISS GPU indices.
Usage
Use this module when performing similarity search on code embeddings, finding diverse hard negative examples for training (by filtering near-duplicates), and building retrieval systems for code-to-code or text-to-code search. It is particularly useful in the BGE-Coder data generation pipeline for finding related but not identical code snippets.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/BGE_Coder/data_generation/search.py
- Lines: 1-71
Signature
def create_index(embeddings: np.ndarray, use_gpu: bool = False):
"""Create FAISS flat inner product index from embeddings"""
def search(
faiss_index: faiss.Index,
k: int = 100,
query_embeddings: Optional[np.ndarray] = None,
load_path: Optional[str] = None
):
"""Search FAISS index and return scores and indices"""
def get_top1(
small_docs,
encoder_name,
docs: List[str],
top: int = 1
):
"""Get top similar documents with Jaccard filtering"""
Import
from search import create_index, search, get_top1
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| embeddings | np.ndarray | Yes | Document embeddings matrix (N x D) |
| use_gpu | bool | No | Whether to use GPU acceleration (default: False) |
| faiss_index | faiss.Index | Yes | FAISS index for searching |
| k | int | No | Number of nearest neighbors to retrieve (default: 100) |
| query_embeddings | np.ndarray | No | Query embeddings for search |
| small_docs | List[str] | Yes | Query documents for get_top1 |
| encoder_name | str | Yes | FlagModel encoder name |
| docs | List[str] | Yes | Corpus documents to search |
Outputs
| Name | Type | Description |
|---|---|---|
| index | faiss.Index | Created FAISS index |
| all_scores | np.ndarray | Similarity scores (N_queries x k) |
| all_indices | np.ndarray | Document indices (N_queries x k) |
| return_docs | List[List[str]] | Filtered top documents per query |
Usage Examples
# Example 1: Create index and search
import numpy as np
from search import create_index, search
# Create document embeddings (100 docs, 768 dim)
doc_embeddings = np.random.randn(100, 768).astype(np.float32)
doc_embeddings = doc_embeddings / np.linalg.norm(doc_embeddings, axis=1, keepdims=True)
# Create FAISS index
index = create_index(doc_embeddings, use_gpu=True)
# Search with query embeddings
query_embeddings = np.random.randn(10, 768).astype(np.float32)
query_embeddings = query_embeddings / np.linalg.norm(query_embeddings, axis=1, keepdims=True)
scores, indices = search(index, k=5, query_embeddings=query_embeddings)
print(f"Top 5 scores for first query: {scores[0]}")
print(f"Top 5 indices for first query: {indices[0]}")
# Example 2: Find diverse similar documents
from search import get_top1
small_docs = ["def foo(): pass", "class Bar: pass"]
corpus_docs = ["def foo(): return 1"] * 50 + ["def bar(): pass"] * 50
# Get top 3 diverse matches per query
results = get_top1(
small_docs=small_docs,
encoder_name="BAAI/bge-base-en-v1.5",
docs=corpus_docs,
top=3
)
for i, similar_docs in enumerate(results):
print(f"Query {i}: Found {len(similar_docs)} diverse matches")