Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Neuml Txtai WordVectors

From Leeroopedia


Knowledge Sources
Domains Word_Embeddings, Text_Encoding
Last Updated 2026-02-09 00:00 GMT

Overview

Static word embeddings with SIF-style weighting, multiprocessing-based indexing, and SQLite-backed vector storage.

Description

The WordVectors class implements text vectorization using pre-trained static word embeddings (e.g., GloVe, word2vec-style models in the staticvectors format). Unlike transformer-based encoders that produce contextual embeddings, WordVectors builds a single vector per text by aggregating individual word vectors using either a weighted average (when a scoring method provides token weights, similar to Smooth Inverse Frequency weighting) or a simple mean when no weights are available.

The class provides two model detection static methods. ismodel(path) checks whether a path points to a WordVectors model by examining either the SQLite database format or the config.json on Hugging Face Hub for "model_type": "staticvectors". isdatabase(path) checks specifically for a SQLite database file used as the word vectors storage format.

The index method implements a multiprocessing pipeline for efficient batch vectorization. It uses Python's multiprocessing.Pool with the module-level create() initializer and transform() worker function. Each subprocess lazily loads its own WordVectors instance via global parameters and transforms documents in parallel. Embeddings are streamed to a temporary NumPy file on disk to control memory usage, with configurable batch sizes.

The encode method handles the core vectorization: tokenizing text (if needed), looking up word vectors via self.model.embeddings(tokens), and computing a weighted average (using the scoring method's token weights) or mean.

Usage

Use this vector backend for lightweight, fast text vectorization when transformer-based models are not required. It is particularly suitable for large-scale indexing where speed and memory efficiency are priorities. The WordVectors class is loaded automatically when the configured model path points to a static vectors model.

Code Reference

Source Location

  • Repository: txtai
  • File: src/python/txtai/vectors/dense/words.py
  • Lines: L1-211

Class Definition

class WordVectors(Vectors):
    """
    Builds vectors using weighted word embeddings.
    """

Constructor Signature

def __init__(self, config, scoring, models):

The constructor checks for staticvectors availability (raising ImportError if missing), then delegates to the parent Vectors.__init__ which handles configuration parsing and calls self.loadmodel(path).

Import

from txtai.vectors.dense import WordVectors

Module-Level Multiprocessing Functions

Two module-level functions support multiprocessing by managing global state in worker subprocesses:

create(config, scoring)

def create(config, scoring):

Pool initializer that stores model parameters (config, scoring, None) in the global PARAMETERS variable and resets VECTORS to None for lazy loading.

transform(document)

def transform(document):

Pool worker function that lazily creates a WordVectors instance from global PARAMETERS on first call, then transforms the document into an embedding vector. Returns (document_id, embedding).

# Lazy initialization pattern
global VECTORS
if not VECTORS:
    VECTORS = WordVectors(*PARAMETERS)

return (document[0], VECTORS.transform(document))

Static Methods

ismodel(path)

@staticmethod
def ismodel(path):

Checks if path is a WordVectors model by:

  1. Checking if the path is a SQLite database via isdatabase().
  2. Downloading config.json from Hugging Face Hub and checking for "model_type": "staticvectors".

Returns True if either check passes.

isdatabase(path)

@staticmethod
def isdatabase(path):

Returns True if the path is a string, staticvectors is available, and Database.isdatabase(path) confirms it is a SQLite database.

Key Instance Methods

loadmodel(path)

def loadmodel(self, path):
    return StaticVectors(path)

Loads a StaticVectors model from the given path. Called by the parent Vectors.__init__.

encode(data, category=None)

def encode(self, data, category=None):

Core vectorization method. For each element in data:

  1. Tokenize strings via Tokenizer.tokenize(); fall back to the raw string if tokenization yields an empty list.
  2. Generate weights using self.scoring.weights(tokens) if a scoring method is available.
  3. Compute a weighted average of word vectors if weights are available and non-zero, or a simple mean otherwise.

Returns a NumPy array of shape (len(data), dimensions) with float32 dtype.

if weights and [x for x in weights if x > 0]:
    embedding = np.average(self.lookup(tokens), weights=np.array(weights, dtype=np.float32), axis=0)
else:
    embedding = np.mean(self.lookup(tokens), axis=0)

index(documents, batchsize=500, checkpoint=None)

def index(self, documents, batchsize=500, checkpoint=None):

Builds an embeddings index using multiprocessing. The parallelism level is controlled by the parallel config key (defaults to os.cpu_count()). If parallel is falsy, falls back to single-process indexing via the parent class.

The multiprocessing pipeline:

  1. Creates a Pool with create as the initializer.
  2. Uses pool.imap(transform, documents, self.encodebatch) for ordered lazy mapping.
  3. Streams batches of embeddings to a temporary .npy file to control memory.
  4. Returns (ids, dimensions, batches, stream_path).

lookup(tokens)

def lookup(self, tokens):

Queries word vectors for a list of tokens by calling self.model.embeddings(tokens). Returns a 2D array of shape (len(tokens), dimensions).

tokens(data)

def tokens(self, data):
    return data

Skips tokenization rules, returning data unchanged. Tokenization is handled inside encode() instead.

Inheritance Chain

WordVectors -> Vectors -> (with Recovery mixin)

The Vectors base class provides configuration parsing, model loading, batch encoding, instruction handling, dimensionality truncation, scalar quantization, and the single-process index() fallback.

Configuration

Key Type Default Description
path str Required Path to static vectors model (Hugging Face Hub ID or local path).
parallel bool or int True Multiprocessing parallelism. True uses os.cpu_count(). An integer sets the exact process count. False/0 disables multiprocessing.
encodebatch int 32 Chunk size for pool.imap ordering.
tokenize bool None Enable optional string tokenization rules.
dimensions int (auto-detected) Number of embedding dimensions.

Usage Examples

Basic Word Vector Encoding

from txtai.embeddings import Embeddings

# Create embeddings with a static word vectors model
embeddings = Embeddings({"path": "neuml/txtai-intro"})

embeddings.index([
    (0, "Machine learning classification", None),
    (1, "Database query optimization", None),
    (2, "Natural language understanding", None)
])

results = embeddings.search("text classification", limit=2)
for uid, score in results:
    print(f"ID: {uid}, Score: {score:.4f}")

Direct WordVectors Usage

from txtai.vectors.dense import WordVectors

# Check if a path is a word vectors model
if WordVectors.ismodel("neuml/txtai-intro"):
    print("Valid WordVectors model")

# Check if a path is a SQLite database
if WordVectors.isdatabase("/path/to/vectors.db"):
    print("SQLite word vectors database")

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment