Implementation:Neuml Txtai WordVectors
| Knowledge Sources | |
|---|---|
| Domains | Word_Embeddings, Text_Encoding |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Static word embeddings with SIF-style weighting, multiprocessing-based indexing, and SQLite-backed vector storage.
Description
The WordVectors class implements text vectorization using pre-trained static word embeddings (e.g., GloVe, word2vec-style models in the staticvectors format). Unlike transformer-based encoders that produce contextual embeddings, WordVectors builds a single vector per text by aggregating individual word vectors using either a weighted average (when a scoring method provides token weights, similar to Smooth Inverse Frequency weighting) or a simple mean when no weights are available.
The class provides two model detection static methods. ismodel(path) checks whether a path points to a WordVectors model by examining either the SQLite database format or the config.json on Hugging Face Hub for "model_type": "staticvectors". isdatabase(path) checks specifically for a SQLite database file used as the word vectors storage format.
The index method implements a multiprocessing pipeline for efficient batch vectorization. It uses Python's multiprocessing.Pool with the module-level create() initializer and transform() worker function. Each subprocess lazily loads its own WordVectors instance via global parameters and transforms documents in parallel. Embeddings are streamed to a temporary NumPy file on disk to control memory usage, with configurable batch sizes.
The encode method handles the core vectorization: tokenizing text (if needed), looking up word vectors via self.model.embeddings(tokens), and computing a weighted average (using the scoring method's token weights) or mean.
Usage
Use this vector backend for lightweight, fast text vectorization when transformer-based models are not required. It is particularly suitable for large-scale indexing where speed and memory efficiency are priorities. The WordVectors class is loaded automatically when the configured model path points to a static vectors model.
Code Reference
Source Location
- Repository: txtai
- File:
src/python/txtai/vectors/dense/words.py - Lines: L1-211
Class Definition
class WordVectors(Vectors):
"""
Builds vectors using weighted word embeddings.
"""
Constructor Signature
def __init__(self, config, scoring, models):
The constructor checks for staticvectors availability (raising ImportError if missing), then delegates to the parent Vectors.__init__ which handles configuration parsing and calls self.loadmodel(path).
Import
from txtai.vectors.dense import WordVectors
Module-Level Multiprocessing Functions
Two module-level functions support multiprocessing by managing global state in worker subprocesses:
create(config, scoring)
def create(config, scoring):
Pool initializer that stores model parameters (config, scoring, None) in the global PARAMETERS variable and resets VECTORS to None for lazy loading.
transform(document)
def transform(document):
Pool worker function that lazily creates a WordVectors instance from global PARAMETERS on first call, then transforms the document into an embedding vector. Returns (document_id, embedding).
# Lazy initialization pattern
global VECTORS
if not VECTORS:
VECTORS = WordVectors(*PARAMETERS)
return (document[0], VECTORS.transform(document))
Static Methods
ismodel(path)
@staticmethod
def ismodel(path):
Checks if path is a WordVectors model by:
- Checking if the path is a SQLite database via
isdatabase(). - Downloading
config.jsonfrom Hugging Face Hub and checking for"model_type": "staticvectors".
Returns True if either check passes.
isdatabase(path)
@staticmethod
def isdatabase(path):
Returns True if the path is a string, staticvectors is available, and Database.isdatabase(path) confirms it is a SQLite database.
Key Instance Methods
loadmodel(path)
def loadmodel(self, path):
return StaticVectors(path)
Loads a StaticVectors model from the given path. Called by the parent Vectors.__init__.
encode(data, category=None)
def encode(self, data, category=None):
Core vectorization method. For each element in data:
- Tokenize strings via
Tokenizer.tokenize(); fall back to the raw string if tokenization yields an empty list. - Generate weights using
self.scoring.weights(tokens)if a scoring method is available. - Compute a weighted average of word vectors if weights are available and non-zero, or a simple mean otherwise.
Returns a NumPy array of shape (len(data), dimensions) with float32 dtype.
if weights and [x for x in weights if x > 0]:
embedding = np.average(self.lookup(tokens), weights=np.array(weights, dtype=np.float32), axis=0)
else:
embedding = np.mean(self.lookup(tokens), axis=0)
index(documents, batchsize=500, checkpoint=None)
def index(self, documents, batchsize=500, checkpoint=None):
Builds an embeddings index using multiprocessing. The parallelism level is controlled by the parallel config key (defaults to os.cpu_count()). If parallel is falsy, falls back to single-process indexing via the parent class.
The multiprocessing pipeline:
- Creates a
Poolwithcreateas the initializer. - Uses
pool.imap(transform, documents, self.encodebatch)for ordered lazy mapping. - Streams batches of embeddings to a temporary
.npyfile to control memory. - Returns
(ids, dimensions, batches, stream_path).
lookup(tokens)
def lookup(self, tokens):
Queries word vectors for a list of tokens by calling self.model.embeddings(tokens). Returns a 2D array of shape (len(tokens), dimensions).
tokens(data)
def tokens(self, data):
return data
Skips tokenization rules, returning data unchanged. Tokenization is handled inside encode() instead.
Inheritance Chain
WordVectors -> Vectors -> (with Recovery mixin)
The Vectors base class provides configuration parsing, model loading, batch encoding, instruction handling, dimensionality truncation, scalar quantization, and the single-process index() fallback.
Configuration
| Key | Type | Default | Description |
|---|---|---|---|
| path | str | Required | Path to static vectors model (Hugging Face Hub ID or local path). |
| parallel | bool or int | True |
Multiprocessing parallelism. True uses os.cpu_count(). An integer sets the exact process count. False/0 disables multiprocessing.
|
| encodebatch | int | 32 |
Chunk size for pool.imap ordering.
|
| tokenize | bool | None | Enable optional string tokenization rules. |
| dimensions | int | (auto-detected) | Number of embedding dimensions. |
Usage Examples
Basic Word Vector Encoding
from txtai.embeddings import Embeddings
# Create embeddings with a static word vectors model
embeddings = Embeddings({"path": "neuml/txtai-intro"})
embeddings.index([
(0, "Machine learning classification", None),
(1, "Database query optimization", None),
(2, "Natural language understanding", None)
])
results = embeddings.search("text classification", limit=2)
for uid, score in results:
print(f"ID: {uid}, Score: {score:.4f}")
Direct WordVectors Usage
from txtai.vectors.dense import WordVectors
# Check if a path is a word vectors model
if WordVectors.ismodel("neuml/txtai-intro"):
print("Valid WordVectors model")
# Check if a path is a SQLite database
if WordVectors.isdatabase("/path/to/vectors.db"):
print("SQLite word vectors database")