Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Neuml Txtai Embeddings Upsert

From Leeroopedia


Knowledge Sources
Domains Semantic_Search, NLP
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for incrementally updating existing indexes provided by the txtai library.

Description

The Embeddings.upsert method adds new documents to or updates existing documents within a live embeddings index without performing a full rebuild. If the index is empty (i.e., count() returns 0), the method falls through to a standard index() call. Otherwise, it creates a Transform and Stream configured for UPSERT action, which causes the database layer to delete any existing records with matching IDs before inserting the new versions. New document vectors are appended to the existing ANN index rather than rebuilding it. The method also updates all auxiliary structures: ID mappings, sparse scoring index, subindexes, and graph.

Usage

Use this method to add new documents or update existing documents in an already-built index. This is the preferred approach for maintaining a live search index that receives continuous updates. It avoids the computational cost of a full rebuild while keeping the index up to date. For initial index creation or full corpus replacement, use index() instead.

Code Reference

Source Location

  • Repository: txtai
  • File: src/python/txtai/embeddings/base.py
  • Lines: L155-201

Signature

def upsert(self, documents, checkpoint=None):

Import

from txtai.embeddings import Embeddings

I/O Contract

Inputs

Name Type Required Description
documents iterable Yes An iterable of documents to insert or update. Accepts the same formats as index(): (id, data, tags) tuples, (id, data) tuples, dict objects, or plain str values. Documents whose IDs already exist in the index will have their old entries deleted and replaced with the new versions. Documents with new IDs are appended.
checkpoint str or None No Path to a checkpoint directory for restart capability on large upsert operations. When provided, progress is periodically saved so the operation can be resumed from the last checkpoint if interrupted. Defaults to None (no checkpointing).

Outputs

Name Type Description
(none) None This method returns None. It operates via side effects: new vectors are appended to self.ann (not rebuilt from scratch), self.ids is updated with new ID mappings (when content is disabled), self.scoring is upserted if a sparse index is configured, self.indexes subindexes are upserted if configured, self.graph is upserted if configured. If the index was empty before the call, the method delegates to self.index() and the side effects are the same as a full index build.

Usage Examples

Basic Example

from txtai.embeddings import Embeddings

# Create and build initial index
embeddings = Embeddings({
    "path": "sentence-transformers/all-MiniLM-L6-v2",
    "content": True
})

embeddings.index([
    (0, "Introductory machine learning concepts", None),
    (1, "Advanced deep learning techniques", None)
])

print(embeddings.count())
# Output: 2

# Upsert: add a new document and update an existing one
embeddings.upsert([
    (1, "Updated: state-of-the-art deep learning methods", None),  # Updates existing ID 1
    (2, "Reinforcement learning for robotics", None)               # New document
])

print(embeddings.count())
# Output: 3

# Verify the update took effect
results = embeddings.search("deep learning", limit=1)
print(results[0]["text"])
# Output: Updated: state-of-the-art deep learning methods

Example: Upsert on Empty Index

from txtai.embeddings import Embeddings

# Create an embeddings instance without indexing
embeddings = Embeddings({
    "path": "sentence-transformers/all-MiniLM-L6-v2",
    "content": True
})

# Upsert on an empty index triggers a full index() call internally
embeddings.upsert([
    (0, "First document in the collection", None),
    (1, "Second document in the collection", None)
])

print(embeddings.count())
# Output: 2

Example: Continuous Updates

from txtai.embeddings import Embeddings

embeddings = Embeddings({
    "path": "sentence-transformers/all-MiniLM-L6-v2",
    "content": True
})

# Initial index
embeddings.index([
    (0, "Breaking news: market rally continues", None),
    (1, "Weather forecast: sunny skies ahead", None)
])

# Simulate continuous updates as new articles arrive
batch_1 = [
    (2, "Tech stocks surge on earnings reports", None),
    (3, "New climate report raises concerns", None)
]
embeddings.upsert(batch_1)

batch_2 = [
    (4, "Sports: championship finals tonight", None),
    (0, "Updated: market rally pauses after midday", None)  # Updates article 0
]
embeddings.upsert(batch_2)

print(embeddings.count())
# Output: 5

# Save the incrementally updated index
embeddings.save("/data/live_index")

Example: Large Upsert with Checkpoint

from txtai.embeddings import Embeddings

embeddings = Embeddings({
    "path": "sentence-transformers/all-MiniLM-L6-v2",
    "content": True
})

# Build initial small index
embeddings.index([(i, f"Initial document {i}", None) for i in range(100)])

# Large upsert with checkpoint for reliability
def new_documents():
    for i in range(100, 100000):
        yield (i, f"New document {i} with detailed content for search", None)

embeddings.upsert(new_documents(), checkpoint="/tmp/upsert_checkpoint")

print(embeddings.count())
# Output: 100000

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment