Implementation:FlagOpen FlagEmbedding AbsEmbedder Encode
| Field | Value |
|---|---|
| type | API Doc |
| source | Repo: FlagOpen/FlagEmbedding https://github.com/FlagOpen/FlagEmbedding |
| domains | NLP, Information_Retrieval |
Overview
Concrete tool for encoding text into embeddings provided by the FlagEmbedding library. This implementation covers three methods: encode(), encode_queries(), and encode_corpus().
Description
The AbsEmbedder base class defines the encoding interface for all FlagEmbedding embedder implementations. It provides three public methods:
- encode_queries() -- encodes query strings with the configured retrieval instruction prefix
- encode_corpus() -- encodes corpus/passage strings without instruction prefixing
- encode() -- general-purpose encoding with optional instruction parameter
When running on multiple GPUs, encoding is automatically parallelized across devices using process pools. For M3 models, encoding returns a dictionary containing dense vectors, lexical weights, and ColBERT vectors.
Usage
Use via a model instance obtained from FlagAutoModel.from_finetuned(). Call encode_queries() for query-side encoding, encode_corpus() for passage-side encoding, or encode() for general encoding with optional instructions.
Code Reference
Source Location: Repository FlagOpen/FlagEmbedding, File: FlagEmbedding/abc/inference/AbsEmbedder.py, Lines: L159-285
encode_queries: L159-191encode_corpus: L193-228encode: L230-285
Signature for encode:
def encode(
self,
sentences: Union[List[str], str],
batch_size: Optional[int] = None,
max_length: Optional[int] = None,
convert_to_numpy: Optional[bool] = None,
instruction: Optional[str] = None,
instruction_format: Optional[str] = None,
**kwargs: Any
) -> Union[torch.Tensor, np.ndarray]:
Signature for encode_queries:
def encode_queries(
self,
queries: Union[List[str], str],
batch_size: Optional[int] = None,
max_length: Optional[int] = None,
convert_to_numpy: Optional[bool] = None,
**kwargs: Any
) -> Union[torch.Tensor, np.ndarray]:
Signature for encode_corpus:
def encode_corpus(
self,
corpus: Union[List[str], str, List[Dict]],
batch_size: Optional[int] = None,
max_length: Optional[int] = None,
convert_to_numpy: Optional[bool] = None,
**kwargs: Any
) -> Union[torch.Tensor, np.ndarray]:
Import:
from FlagEmbedding import FlagAutoModel
# Access encode methods via model instance
I/O Contract
Inputs (encode):
| Parameter | Type | Required | Description |
|---|---|---|---|
| sentences | Union[List[str], str] | Yes | Text string or list of text strings to encode |
| batch_size | Optional[int] | No | Batch size for encoding (defaults to model default) |
| max_length | Optional[int] | No | Maximum token length for truncation |
| convert_to_numpy | Optional[bool] | No | Convert output to numpy array (default behavior) |
| instruction | Optional[str] | No | Instruction string to prepend to each sentence |
| instruction_format | Optional[str] | No | Template for formatting the instruction with the sentence |
Inputs (encode_queries):
| Parameter | Type | Required | Description |
|---|---|---|---|
| queries | Union[List[str], str] | Yes | Query string or list of query strings |
| batch_size | Optional[int] | No | Batch size for encoding |
| max_length | Optional[int] | No | Maximum token length for truncation |
| convert_to_numpy | Optional[bool] | No | Convert output to numpy array |
Inputs (encode_corpus):
| Parameter | Type | Required | Description |
|---|---|---|---|
| corpus | Union[List[str], str, List[Dict]] | Yes | Corpus string(s) or list of dicts with title/text keys |
| batch_size | Optional[int] | No | Batch size for encoding |
| max_length | Optional[int] | No | Maximum token length for truncation |
| convert_to_numpy | Optional[bool] | No | Convert output to numpy array |
Outputs:
| Name | Type | Description |
|---|---|---|
| return (standard models) | np.ndarray | Array of shape (n, dim) where n is number of input sentences and dim is embedding dimension |
| return (M3 models) | Dict | Dictionary with keys: dense_vecs (np.ndarray), lexical_weights (List[Dict]), colbert_vecs (List[np.ndarray]) |
Usage Examples
Example 1: Basic encoding with encode()
from FlagEmbedding import FlagAutoModel
model = FlagAutoModel.from_finetuned('BAAI/bge-base-en-v1.5')
sentences = ["Hello world", "Text embedding is useful"]
embeddings = model.encode(sentences)
print(embeddings.shape) # (2, 768)
Example 2: Encoding queries with encode_queries()
from FlagEmbedding import FlagAutoModel
model = FlagAutoModel.from_finetuned('BAAI/bge-base-en-v1.5')
queries = ["What is text embedding?", "How does retrieval work?"]
query_embeddings = model.encode_queries(queries)
print(query_embeddings.shape) # (2, 768)
Example 3: Encoding corpus with encode_corpus()
from FlagEmbedding import FlagAutoModel
model = FlagAutoModel.from_finetuned('BAAI/bge-base-en-v1.5')
passages = [
"Text embedding maps text to vectors.",
"Dense retrieval uses neural networks for search."
]
corpus_embeddings = model.encode_corpus(passages)
print(corpus_embeddings.shape) # (2, 768)
Example 4: M3 model encoding with multiple output types
from FlagEmbedding import FlagAutoModel
model = FlagAutoModel.from_finetuned('BAAI/bge-m3')
sentences = ["BGE M3 supports multiple retrieval methods"]
output = model.encode(sentences)
# For M3 models, output is a dict:
# output['dense_vecs'] -> np.ndarray of dense embeddings
# output['lexical_weights'] -> list of sparse weight dicts
# output['colbert_vecs'] -> list of ColBERT token-level arrays