Workflow:FlagOpen FlagEmbedding Embedder Inference
| Knowledge Sources | |
|---|---|
| Domains | Text_Embeddings, Information_Retrieval, NLP |
| Last Updated | 2026-02-09 21:30 GMT |
Overview
End-to-end process for loading a BGE embedding model and encoding text into dense vector representations for similarity computation and retrieval tasks.
Description
This workflow covers the standard procedure for using BGE embedding models to generate text embeddings. It supports multiple model architectures including encoder-only models (bge-base/large-en-v1.5), multilingual M3 models with dense+sparse+ColBERT retrieval, LLM-based decoder-only models (bge-multilingual-gemma2), and in-context learning models (bge-en-icl). The auto-loading factory pattern detects the model type and instantiates the appropriate embedder class, handling multi-device distribution transparently.
Usage
Execute this workflow when you need to generate dense vector embeddings from text for tasks such as semantic search, passage retrieval, document similarity, clustering, or classification. The workflow applies to any scenario requiring sentence or passage representations using a pre-trained or fine-tuned BGE model.
Execution Steps
Step 1: Install FlagEmbedding
Install the FlagEmbedding package from PyPI or from source. The inference-only installation does not require the finetune dependency extras.
Key considerations:
- Use
pip install -U FlagEmbeddingfor inference-only usage - GPU support requires CUDA-compatible PyTorch installation
Step 2: Load the Embedding Model
Initialize the embedding model using the auto-loading factory. The factory method detects the model type from the HuggingFace model card and instantiates the appropriate embedder subclass (BaseEmbedder, M3Embedder, BaseLLMEmbedder, or ICLLLMEmbedder). Specify device placement, precision settings, and query instructions at load time.
Key considerations:
- FlagAutoModel.from_finetuned() auto-detects the model architecture
- For custom or fine-tuned models, specify model_class explicitly (encoder-only-base, encoder-only-m3, decoder-only-base, decoder-only-icl)
- Enable use_fp16 for faster inference with minimal accuracy loss
- Multi-GPU inference is supported via the devices parameter
Step 3: Encode Text into Embeddings
Feed sentences or documents to the model to produce embedding vectors. For asymmetric retrieval (short query to long passage), use separate encode_queries() and encode_corpus() methods, which automatically prepend query instructions. For symmetric tasks, use the generic encode() method.
Key considerations:
- encode_queries() adds the query_instruction_for_retrieval prefix automatically
- encode_corpus() encodes passages without instruction prefixes
- For M3 models, specify return_dense, return_sparse, and return_colbert_vecs to select retrieval methods
- For ICL models, provide few-shot examples via examples_for_task parameter
Step 4: Compute Similarity Scores
Calculate pairwise similarity between query and passage embeddings using matrix multiplication (inner product). Normalized embeddings support cosine similarity via dot product. For M3 models, combine dense, sparse (lexical matching), and ColBERT scores for hybrid retrieval.
Key considerations:
- With normalize_embeddings=True, dot product equals cosine similarity
- M3 hybrid scoring combines dense_vecs, lexical_weights, and colbert_vecs
- Use compute_lexical_matching_score() for sparse similarity with M3
- Scores can be used to rank passages for a given query