Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:FlagOpen FlagEmbedding Embedder Inference

From Leeroopedia


Knowledge Sources
Domains Text_Embeddings, Information_Retrieval, NLP
Last Updated 2026-02-09 21:30 GMT

Overview

End-to-end process for loading a BGE embedding model and encoding text into dense vector representations for similarity computation and retrieval tasks.

Description

This workflow covers the standard procedure for using BGE embedding models to generate text embeddings. It supports multiple model architectures including encoder-only models (bge-base/large-en-v1.5), multilingual M3 models with dense+sparse+ColBERT retrieval, LLM-based decoder-only models (bge-multilingual-gemma2), and in-context learning models (bge-en-icl). The auto-loading factory pattern detects the model type and instantiates the appropriate embedder class, handling multi-device distribution transparently.

Usage

Execute this workflow when you need to generate dense vector embeddings from text for tasks such as semantic search, passage retrieval, document similarity, clustering, or classification. The workflow applies to any scenario requiring sentence or passage representations using a pre-trained or fine-tuned BGE model.

Execution Steps

Step 1: Install FlagEmbedding

Install the FlagEmbedding package from PyPI or from source. The inference-only installation does not require the finetune dependency extras.

Key considerations:

  • Use pip install -U FlagEmbedding for inference-only usage
  • GPU support requires CUDA-compatible PyTorch installation

Step 2: Load the Embedding Model

Initialize the embedding model using the auto-loading factory. The factory method detects the model type from the HuggingFace model card and instantiates the appropriate embedder subclass (BaseEmbedder, M3Embedder, BaseLLMEmbedder, or ICLLLMEmbedder). Specify device placement, precision settings, and query instructions at load time.

Key considerations:

  • FlagAutoModel.from_finetuned() auto-detects the model architecture
  • For custom or fine-tuned models, specify model_class explicitly (encoder-only-base, encoder-only-m3, decoder-only-base, decoder-only-icl)
  • Enable use_fp16 for faster inference with minimal accuracy loss
  • Multi-GPU inference is supported via the devices parameter

Step 3: Encode Text into Embeddings

Feed sentences or documents to the model to produce embedding vectors. For asymmetric retrieval (short query to long passage), use separate encode_queries() and encode_corpus() methods, which automatically prepend query instructions. For symmetric tasks, use the generic encode() method.

Key considerations:

  • encode_queries() adds the query_instruction_for_retrieval prefix automatically
  • encode_corpus() encodes passages without instruction prefixes
  • For M3 models, specify return_dense, return_sparse, and return_colbert_vecs to select retrieval methods
  • For ICL models, provide few-shot examples via examples_for_task parameter

Step 4: Compute Similarity Scores

Calculate pairwise similarity between query and passage embeddings using matrix multiplication (inner product). Normalized embeddings support cosine similarity via dot product. For M3 models, combine dense, sparse (lexical matching), and ColBERT scores for hybrid retrieval.

Key considerations:

  • With normalize_embeddings=True, dot product equals cosine similarity
  • M3 hybrid scoring combines dense_vecs, lexical_weights, and colbert_vecs
  • Use compute_lexical_matching_score() for sparse similarity with M3
  • Scores can be used to rank passages for a given query

Execution Diagram

GitHub URL

Workflow Repository