Implementation:Neuml Txtai WordVectors

Knowledge Sources	txtai txtai Documentation
Domains	Word_Embeddings, Text_Encoding
Last Updated	2026-02-09 00:00 GMT

Overview

Static word embeddings with SIF-style weighting, multiprocessing-based indexing, and SQLite-backed vector storage.

Description

The WordVectors class implements text vectorization using pre-trained static word embeddings (e.g., GloVe, word2vec-style models in the staticvectors format). Unlike transformer-based encoders that produce contextual embeddings, WordVectors builds a single vector per text by aggregating individual word vectors using either a weighted average (when a scoring method provides token weights, similar to Smooth Inverse Frequency weighting) or a simple mean when no weights are available.

The class provides two model detection static methods. ismodel(path) checks whether a path points to a WordVectors model by examining either the SQLite database format or the config.json on Hugging Face Hub for "model_type": "staticvectors". isdatabase(path) checks specifically for a SQLite database file used as the word vectors storage format.

The index method implements a multiprocessing pipeline for efficient batch vectorization. It uses Python's multiprocessing.Pool with the module-level create() initializer and transform() worker function. Each subprocess lazily loads its own WordVectors instance via global parameters and transforms documents in parallel. Embeddings are streamed to a temporary NumPy file on disk to control memory usage, with configurable batch sizes.

The encode method handles the core vectorization: tokenizing text (if needed), looking up word vectors via self.model.embeddings(tokens), and computing a weighted average (using the scoring method's token weights) or mean.

Usage

Use this vector backend for lightweight, fast text vectorization when transformer-based models are not required. It is particularly suitable for large-scale indexing where speed and memory efficiency are priorities. The WordVectors class is loaded automatically when the configured model path points to a static vectors model.

Code Reference

Source Location

Repository: txtai
File: src/python/txtai/vectors/dense/words.py
Lines: L1-211

Class Definition

class WordVectors(Vectors):
    """
    Builds vectors using weighted word embeddings.
    """

Constructor Signature

def __init__(self, config, scoring, models):

The constructor checks for staticvectors availability (raising ImportError if missing), then delegates to the parent Vectors.__init__ which handles configuration parsing and calls self.loadmodel(path).

Import

from txtai.vectors.dense import WordVectors

Module-Level Multiprocessing Functions

Two module-level functions support multiprocessing by managing global state in worker subprocesses:

create(config, scoring)

def create(config, scoring):

Pool initializer that stores model parameters (config, scoring, None) in the global PARAMETERS variable and resets VECTORS to None for lazy loading.

transform(document)

def transform(document):

Pool worker function that lazily creates a WordVectors instance from global PARAMETERS on first call, then transforms the document into an embedding vector. Returns (document_id, embedding).

# Lazy initialization pattern
global VECTORS
if not VECTORS:
    VECTORS = WordVectors(*PARAMETERS)

return (document[0], VECTORS.transform(document))

Static Methods

ismodel(path)

@staticmethod
def ismodel(path):

Checks if path is a WordVectors model by:

Checking if the path is a SQLite database via isdatabase().
Downloading config.json from Hugging Face Hub and checking for "model_type": "staticvectors".

Returns True if either check passes.

isdatabase(path)

@staticmethod
def isdatabase(path):

Returns True if the path is a string, staticvectors is available, and Database.isdatabase(path) confirms it is a SQLite database.

Key Instance Methods

loadmodel(path)

def loadmodel(self, path):
    return StaticVectors(path)

Loads a StaticVectors model from the given path. Called by the parent Vectors.__init__.

encode(data, category=None)

def encode(self, data, category=None):

Core vectorization method. For each element in data:

Tokenize strings via Tokenizer.tokenize(); fall back to the raw string if tokenization yields an empty list.
Generate weights using self.scoring.weights(tokens) if a scoring method is available.
Compute a weighted average of word vectors if weights are available and non-zero, or a simple mean otherwise.

Returns a NumPy array of shape (len(data), dimensions) with float32 dtype.

if weights and [x for x in weights if x > 0]:
    embedding = np.average(self.lookup(tokens), weights=np.array(weights, dtype=np.float32), axis=0)
else:
    embedding = np.mean(self.lookup(tokens), axis=0)

index(documents, batchsize=500, checkpoint=None)

def index(self, documents, batchsize=500, checkpoint=None):

Builds an embeddings index using multiprocessing. The parallelism level is controlled by the parallel config key (defaults to os.cpu_count()). If parallel is falsy, falls back to single-process indexing via the parent class.

The multiprocessing pipeline:

Creates a Pool with create as the initializer.
Uses pool.imap(transform, documents, self.encodebatch) for ordered lazy mapping.
Streams batches of embeddings to a temporary .npy file to control memory.
Returns (ids, dimensions, batches, stream_path).

lookup(tokens)

def lookup(self, tokens):

Queries word vectors for a list of tokens by calling self.model.embeddings(tokens). Returns a 2D array of shape (len(tokens), dimensions).

tokens(data)

def tokens(self, data):
    return data

Skips tokenization rules, returning data unchanged. Tokenization is handled inside encode() instead.

Inheritance Chain

WordVectors -> Vectors -> (with Recovery mixin)

The Vectors base class provides configuration parsing, model loading, batch encoding, instruction handling, dimensionality truncation, scalar quantization, and the single-process index() fallback.

Configuration

Key	Type	Default	Description
path	str	Required	Path to static vectors model (Hugging Face Hub ID or local path).
parallel	bool or int	`True`	Multiprocessing parallelism. `True` uses `os.cpu_count()`. An integer sets the exact process count. `False`/0 disables multiprocessing.
encodebatch	int	`32`	Chunk size for `pool.imap` ordering.
tokenize	bool	None	Enable optional string tokenization rules.
dimensions	int	(auto-detected)	Number of embedding dimensions.

Usage Examples

Basic Word Vector Encoding

from txtai.embeddings import Embeddings

# Create embeddings with a static word vectors model
embeddings = Embeddings({"path": "neuml/txtai-intro"})

embeddings.index([
    (0, "Machine learning classification", None),
    (1, "Database query optimization", None),
    (2, "Natural language understanding", None)
])

results = embeddings.search("text classification", limit=2)
for uid, score in results:
    print(f"ID: {uid}, Score: {score:.4f}")

Direct WordVectors Usage

from txtai.vectors.dense import WordVectors

# Check if a path is a word vectors model
if WordVectors.ismodel("neuml/txtai-intro"):
    print("Valid WordVectors model")

# Check if a path is a SQLite database
if WordVectors.isdatabase("/path/to/vectors.db"):
    print("SQLite word vectors database")

Related Pages

Implements Principle

Principle:Neuml_Txtai_Word_Embedding_Vectorization

Requires Environment

Environment:Neuml_Txtai_Python_Vectors_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment