Implementation:Neuml Txtai Embeddings Upsert
| Knowledge Sources | |
|---|---|
| Domains | Semantic_Search, NLP |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for incrementally updating existing indexes provided by the txtai library.
Description
The Embeddings.upsert method adds new documents to or updates existing documents within a live embeddings index without performing a full rebuild. If the index is empty (i.e., count() returns 0), the method falls through to a standard index() call. Otherwise, it creates a Transform and Stream configured for UPSERT action, which causes the database layer to delete any existing records with matching IDs before inserting the new versions. New document vectors are appended to the existing ANN index rather than rebuilding it. The method also updates all auxiliary structures: ID mappings, sparse scoring index, subindexes, and graph.
Usage
Use this method to add new documents or update existing documents in an already-built index. This is the preferred approach for maintaining a live search index that receives continuous updates. It avoids the computational cost of a full rebuild while keeping the index up to date. For initial index creation or full corpus replacement, use index() instead.
Code Reference
Source Location
- Repository: txtai
- File:
src/python/txtai/embeddings/base.py - Lines: L155-201
Signature
def upsert(self, documents, checkpoint=None):
Import
from txtai.embeddings import Embeddings
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| documents | iterable | Yes | An iterable of documents to insert or update. Accepts the same formats as index(): (id, data, tags) tuples, (id, data) tuples, dict objects, or plain str values. Documents whose IDs already exist in the index will have their old entries deleted and replaced with the new versions. Documents with new IDs are appended.
|
| checkpoint | str or None | No | Path to a checkpoint directory for restart capability on large upsert operations. When provided, progress is periodically saved so the operation can be resumed from the last checkpoint if interrupted. Defaults to None (no checkpointing). |
Outputs
| Name | Type | Description |
|---|---|---|
| (none) | None | This method returns None. It operates via side effects: new vectors are appended to self.ann (not rebuilt from scratch), self.ids is updated with new ID mappings (when content is disabled), self.scoring is upserted if a sparse index is configured, self.indexes subindexes are upserted if configured, self.graph is upserted if configured. If the index was empty before the call, the method delegates to self.index() and the side effects are the same as a full index build.
|
Usage Examples
Basic Example
from txtai.embeddings import Embeddings
# Create and build initial index
embeddings = Embeddings({
"path": "sentence-transformers/all-MiniLM-L6-v2",
"content": True
})
embeddings.index([
(0, "Introductory machine learning concepts", None),
(1, "Advanced deep learning techniques", None)
])
print(embeddings.count())
# Output: 2
# Upsert: add a new document and update an existing one
embeddings.upsert([
(1, "Updated: state-of-the-art deep learning methods", None), # Updates existing ID 1
(2, "Reinforcement learning for robotics", None) # New document
])
print(embeddings.count())
# Output: 3
# Verify the update took effect
results = embeddings.search("deep learning", limit=1)
print(results[0]["text"])
# Output: Updated: state-of-the-art deep learning methods
Example: Upsert on Empty Index
from txtai.embeddings import Embeddings
# Create an embeddings instance without indexing
embeddings = Embeddings({
"path": "sentence-transformers/all-MiniLM-L6-v2",
"content": True
})
# Upsert on an empty index triggers a full index() call internally
embeddings.upsert([
(0, "First document in the collection", None),
(1, "Second document in the collection", None)
])
print(embeddings.count())
# Output: 2
Example: Continuous Updates
from txtai.embeddings import Embeddings
embeddings = Embeddings({
"path": "sentence-transformers/all-MiniLM-L6-v2",
"content": True
})
# Initial index
embeddings.index([
(0, "Breaking news: market rally continues", None),
(1, "Weather forecast: sunny skies ahead", None)
])
# Simulate continuous updates as new articles arrive
batch_1 = [
(2, "Tech stocks surge on earnings reports", None),
(3, "New climate report raises concerns", None)
]
embeddings.upsert(batch_1)
batch_2 = [
(4, "Sports: championship finals tonight", None),
(0, "Updated: market rally pauses after midday", None) # Updates article 0
]
embeddings.upsert(batch_2)
print(embeddings.count())
# Output: 5
# Save the incrementally updated index
embeddings.save("/data/live_index")
Example: Large Upsert with Checkpoint
from txtai.embeddings import Embeddings
embeddings = Embeddings({
"path": "sentence-transformers/all-MiniLM-L6-v2",
"content": True
})
# Build initial small index
embeddings.index([(i, f"Initial document {i}", None) for i in range(100)])
# Large upsert with checkpoint for reliability
def new_documents():
for i in range(100, 100000):
yield (i, f"New document {i} with detailed content for search", None)
embeddings.upsert(new_documents(), checkpoint="/tmp/upsert_checkpoint")
print(embeddings.count())
# Output: 100000