Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FlagOpen FlagEmbedding AbsEmbedder Encode

From Leeroopedia
Revision as of 14:58, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/FlagOpen_FlagEmbedding_AbsEmbedder_Encode.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Field Value
type API Doc
source Repo: FlagOpen/FlagEmbedding https://github.com/FlagOpen/FlagEmbedding
domains NLP, Information_Retrieval

Overview

Concrete tool for encoding text into embeddings provided by the FlagEmbedding library. This implementation covers three methods: encode(), encode_queries(), and encode_corpus().

Description

The AbsEmbedder base class defines the encoding interface for all FlagEmbedding embedder implementations. It provides three public methods:

  • encode_queries() -- encodes query strings with the configured retrieval instruction prefix
  • encode_corpus() -- encodes corpus/passage strings without instruction prefixing
  • encode() -- general-purpose encoding with optional instruction parameter

When running on multiple GPUs, encoding is automatically parallelized across devices using process pools. For M3 models, encoding returns a dictionary containing dense vectors, lexical weights, and ColBERT vectors.

Usage

Use via a model instance obtained from FlagAutoModel.from_finetuned(). Call encode_queries() for query-side encoding, encode_corpus() for passage-side encoding, or encode() for general encoding with optional instructions.

Code Reference

Source Location: Repository FlagOpen/FlagEmbedding, File: FlagEmbedding/abc/inference/AbsEmbedder.py, Lines: L159-285

  • encode_queries: L159-191
  • encode_corpus: L193-228
  • encode: L230-285

Signature for encode:

def encode(
    self,
    sentences: Union[List[str], str],
    batch_size: Optional[int] = None,
    max_length: Optional[int] = None,
    convert_to_numpy: Optional[bool] = None,
    instruction: Optional[str] = None,
    instruction_format: Optional[str] = None,
    **kwargs: Any
) -> Union[torch.Tensor, np.ndarray]:

Signature for encode_queries:

def encode_queries(
    self,
    queries: Union[List[str], str],
    batch_size: Optional[int] = None,
    max_length: Optional[int] = None,
    convert_to_numpy: Optional[bool] = None,
    **kwargs: Any
) -> Union[torch.Tensor, np.ndarray]:

Signature for encode_corpus:

def encode_corpus(
    self,
    corpus: Union[List[str], str, List[Dict]],
    batch_size: Optional[int] = None,
    max_length: Optional[int] = None,
    convert_to_numpy: Optional[bool] = None,
    **kwargs: Any
) -> Union[torch.Tensor, np.ndarray]:

Import:

from FlagEmbedding import FlagAutoModel
# Access encode methods via model instance

I/O Contract

Inputs (encode):

Parameter Type Required Description
sentences Union[List[str], str] Yes Text string or list of text strings to encode
batch_size Optional[int] No Batch size for encoding (defaults to model default)
max_length Optional[int] No Maximum token length for truncation
convert_to_numpy Optional[bool] No Convert output to numpy array (default behavior)
instruction Optional[str] No Instruction string to prepend to each sentence
instruction_format Optional[str] No Template for formatting the instruction with the sentence

Inputs (encode_queries):

Parameter Type Required Description
queries Union[List[str], str] Yes Query string or list of query strings
batch_size Optional[int] No Batch size for encoding
max_length Optional[int] No Maximum token length for truncation
convert_to_numpy Optional[bool] No Convert output to numpy array

Inputs (encode_corpus):

Parameter Type Required Description
corpus Union[List[str], str, List[Dict]] Yes Corpus string(s) or list of dicts with title/text keys
batch_size Optional[int] No Batch size for encoding
max_length Optional[int] No Maximum token length for truncation
convert_to_numpy Optional[bool] No Convert output to numpy array

Outputs:

Name Type Description
return (standard models) np.ndarray Array of shape (n, dim) where n is number of input sentences and dim is embedding dimension
return (M3 models) Dict Dictionary with keys: dense_vecs (np.ndarray), lexical_weights (List[Dict]), colbert_vecs (List[np.ndarray])

Usage Examples

Example 1: Basic encoding with encode()

from FlagEmbedding import FlagAutoModel

model = FlagAutoModel.from_finetuned('BAAI/bge-base-en-v1.5')

sentences = ["Hello world", "Text embedding is useful"]
embeddings = model.encode(sentences)
print(embeddings.shape)  # (2, 768)

Example 2: Encoding queries with encode_queries()

from FlagEmbedding import FlagAutoModel

model = FlagAutoModel.from_finetuned('BAAI/bge-base-en-v1.5')

queries = ["What is text embedding?", "How does retrieval work?"]
query_embeddings = model.encode_queries(queries)
print(query_embeddings.shape)  # (2, 768)

Example 3: Encoding corpus with encode_corpus()

from FlagEmbedding import FlagAutoModel

model = FlagAutoModel.from_finetuned('BAAI/bge-base-en-v1.5')

passages = [
    "Text embedding maps text to vectors.",
    "Dense retrieval uses neural networks for search."
]
corpus_embeddings = model.encode_corpus(passages)
print(corpus_embeddings.shape)  # (2, 768)

Example 4: M3 model encoding with multiple output types

from FlagEmbedding import FlagAutoModel

model = FlagAutoModel.from_finetuned('BAAI/bge-m3')

sentences = ["BGE M3 supports multiple retrieval methods"]
output = model.encode(sentences)
# For M3 models, output is a dict:
# output['dense_vecs']     -> np.ndarray of dense embeddings
# output['lexical_weights'] -> list of sparse weight dicts
# output['colbert_vecs']    -> list of ColBERT token-level arrays

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment