Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FlagOpen FlagEmbedding BGE Coder FAISS Search

From Leeroopedia


Knowledge Sources
Domains Vector Search, FAISS, Code Retrieval
Last Updated 2026-02-09 00:00 GMT

Overview

FAISS-based vector search utilities for code retrieval with embedding generation and similarity filtering.

Description

This module provides helper functions for creating FAISS indices and performing vector similarity search on code embeddings. It includes functionality for creating flat inner product FAISS indices with optional GPU acceleration, batch searching through large embedding collections, and finding top-k similar documents with Jaccard similarity filtering to avoid near-duplicate results. The module integrates with FlagEmbedding's FlagModel for generating code embeddings and supports GPU-accelerated search using FAISS GPU indices.

Usage

Use this module when performing similarity search on code embeddings, finding diverse hard negative examples for training (by filtering near-duplicates), and building retrieval systems for code-to-code or text-to-code search. It is particularly useful in the BGE-Coder data generation pipeline for finding related but not identical code snippets.

Code Reference

Source Location

Signature

def create_index(embeddings: np.ndarray, use_gpu: bool = False):
    """Create FAISS flat inner product index from embeddings"""

def search(
    faiss_index: faiss.Index,
    k: int = 100,
    query_embeddings: Optional[np.ndarray] = None,
    load_path: Optional[str] = None
):
    """Search FAISS index and return scores and indices"""

def get_top1(
    small_docs,
    encoder_name,
    docs: List[str],
    top: int = 1
):
    """Get top similar documents with Jaccard filtering"""

Import

from search import create_index, search, get_top1

I/O Contract

Inputs

Name Type Required Description
embeddings np.ndarray Yes Document embeddings matrix (N x D)
use_gpu bool No Whether to use GPU acceleration (default: False)
faiss_index faiss.Index Yes FAISS index for searching
k int No Number of nearest neighbors to retrieve (default: 100)
query_embeddings np.ndarray No Query embeddings for search
small_docs List[str] Yes Query documents for get_top1
encoder_name str Yes FlagModel encoder name
docs List[str] Yes Corpus documents to search

Outputs

Name Type Description
index faiss.Index Created FAISS index
all_scores np.ndarray Similarity scores (N_queries x k)
all_indices np.ndarray Document indices (N_queries x k)
return_docs List[List[str]] Filtered top documents per query

Usage Examples

# Example 1: Create index and search
import numpy as np
from search import create_index, search

# Create document embeddings (100 docs, 768 dim)
doc_embeddings = np.random.randn(100, 768).astype(np.float32)
doc_embeddings = doc_embeddings / np.linalg.norm(doc_embeddings, axis=1, keepdims=True)

# Create FAISS index
index = create_index(doc_embeddings, use_gpu=True)

# Search with query embeddings
query_embeddings = np.random.randn(10, 768).astype(np.float32)
query_embeddings = query_embeddings / np.linalg.norm(query_embeddings, axis=1, keepdims=True)

scores, indices = search(index, k=5, query_embeddings=query_embeddings)
print(f"Top 5 scores for first query: {scores[0]}")
print(f"Top 5 indices for first query: {indices[0]}")

# Example 2: Find diverse similar documents
from search import get_top1

small_docs = ["def foo(): pass", "class Bar: pass"]
corpus_docs = ["def foo(): return 1"] * 50 + ["def bar(): pass"] * 50

# Get top 3 diverse matches per query
results = get_top1(
    small_docs=small_docs,
    encoder_name="BAAI/bge-base-en-v1.5",
    docs=corpus_docs,
    top=3
)

for i, similar_docs in enumerate(results):
    print(f"Query {i}: Found {len(similar_docs)} diverse matches")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment