Workflow:FlagOpen FlagEmbedding Embedder Inference

Knowledge Sources	FlagEmbedding BGE Documentation
Domains	Text_Embeddings, Information_Retrieval, NLP
Last Updated	2026-02-09 21:30 GMT

Overview

End-to-end process for loading a BGE embedding model and encoding text into dense vector representations for similarity computation and retrieval tasks.

Description

This workflow covers the standard procedure for using BGE embedding models to generate text embeddings. It supports multiple model architectures including encoder-only models (bge-base/large-en-v1.5), multilingual M3 models with dense+sparse+ColBERT retrieval, LLM-based decoder-only models (bge-multilingual-gemma2), and in-context learning models (bge-en-icl). The auto-loading factory pattern detects the model type and instantiates the appropriate embedder class, handling multi-device distribution transparently.

Usage

Execute this workflow when you need to generate dense vector embeddings from text for tasks such as semantic search, passage retrieval, document similarity, clustering, or classification. The workflow applies to any scenario requiring sentence or passage representations using a pre-trained or fine-tuned BGE model.

Execution Steps

Step 1: Install FlagEmbedding

Install the FlagEmbedding package from PyPI or from source. The inference-only installation does not require the finetune dependency extras.

Key considerations:

Use pip install -U FlagEmbedding for inference-only usage
GPU support requires CUDA-compatible PyTorch installation

Step 2: Load the Embedding Model

Initialize the embedding model using the auto-loading factory. The factory method detects the model type from the HuggingFace model card and instantiates the appropriate embedder subclass (BaseEmbedder, M3Embedder, BaseLLMEmbedder, or ICLLLMEmbedder). Specify device placement, precision settings, and query instructions at load time.

Key considerations:

FlagAutoModel.from_finetuned() auto-detects the model architecture
For custom or fine-tuned models, specify model_class explicitly (encoder-only-base, encoder-only-m3, decoder-only-base, decoder-only-icl)
Enable use_fp16 for faster inference with minimal accuracy loss
Multi-GPU inference is supported via the devices parameter

Step 3: Encode Text into Embeddings

Feed sentences or documents to the model to produce embedding vectors. For asymmetric retrieval (short query to long passage), use separate encode_queries() and encode_corpus() methods, which automatically prepend query instructions. For symmetric tasks, use the generic encode() method.

Key considerations:

encode_queries() adds the query_instruction_for_retrieval prefix automatically
encode_corpus() encodes passages without instruction prefixes
For M3 models, specify return_dense, return_sparse, and return_colbert_vecs to select retrieval methods
For ICL models, provide few-shot examples via examples_for_task parameter

Step 4: Compute Similarity Scores

Calculate pairwise similarity between query and passage embeddings using matrix multiplication (inner product). Normalized embeddings support cosine similarity via dot product. For M3 models, combine dense, sparse (lexical matching), and ColBERT scores for hybrid retrieval.

Key considerations:

With normalize_embeddings=True, dot product equals cosine similarity
M3 hybrid scoring combines dense_vecs, lexical_weights, and colbert_vecs
Use compute_lexical_matching_score() for sparse similarity with M3
Scores can be used to rank passages for a given query

Execution Diagram

GitHub URL

Workflow Repository