Principle:Neuml Txtai Semantic Search
| Knowledge Sources | |
|---|---|
| Domains | Semantic_Search, NLP, Information_Retrieval |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Semantic search is the retrieval of documents based on the meaning of a query rather than exact keyword matching, using vector similarity to find conceptually related results.
Description
Traditional keyword search relies on lexical overlap between query terms and document terms. Semantic search overcomes this limitation by representing both queries and documents as vectors in a shared embedding space, where proximity corresponds to semantic similarity. A query about "automobile repair" will match documents about "car maintenance" even though they share no common keywords, because their vector representations are close in the embedding space.
The search process begins by encoding the query text through the same embedding model used during indexing. The resulting query vector is then compared against all indexed document vectors using an Approximate Nearest Neighbor (ANN) algorithm, which returns the top-k most similar documents along with their similarity scores. This entire process typically completes in milliseconds even for indexes containing millions of documents, thanks to the sublinear search complexity of ANN data structures.
Semantic search can be extended in several ways. Hybrid search combines dense vector similarity with sparse keyword scores (e.g., BM25) using a configurable weighting parameter, capturing both semantic and lexical relevance signals. SQL-like filtering allows results to be constrained by metadata predicates (e.g., date ranges, categories) applied after or during the similarity search. Graph-based search returns not just matching documents but the network of relationships around them. These extensions make semantic search applicable to a wide range of information retrieval scenarios.
Usage
Use semantic search when you need to retrieve documents by conceptual meaning rather than exact keyword matching. It is particularly effective when users express their information needs in natural language, when documents use varied terminology to describe the same concepts, or when the search domain requires understanding synonymy and paraphrasing. Hybrid search should be preferred when both exact term matching and semantic understanding are important.
Theoretical Basis
1. Query Encoding: The query string q is transformed into a vector v_q using the same embedding model M used during indexing:
v_q = M(q), where v_q in R^d
This ensures that queries and documents exist in the same vector space and are directly comparable.
2. Approximate Nearest Neighbor Retrieval: Given v_q and an index of n document vectors, the ANN algorithm returns the top-k approximate nearest neighbors:
results = ANN.query(v_q, k) = {(id_1, s_1), ..., (id_k, s_k)}
where s_i = sim(v_q, v_{id_i}) and s_1 >= s_2 >= ... >= s_k. The similarity function is typically cosine similarity (equivalent to dot product for normalized vectors).
3. Hybrid Scoring: When both dense and sparse indexes are available, the final score is a weighted combination:
s_hybrid(q, d) = w * s_dense(q, d) + (1 - w) * s_sparse(q, d)
where w in [0, 1] (default 0.5) controls the balance. Setting w = 1 yields pure dense search; w = 0 yields pure sparse search.
4. SQL Filtering: When a document database is present, queries can include SQL predicates:
SELECT id, text, score FROM txtai WHERE similar("query text") AND metadata_field = value
The SQL engine combines the similarity ranking with relational filtering to produce a final result set.
5. Result Types: The output format depends on the index configuration:
- Index-only (no content storage): Returns [(id, score), ...] tuples
- Index + database (content enabled): Returns [{"id": ..., "text": ..., "score": ...}, ...] dicts
- Graph mode: Returns a subgraph object containing matching nodes and their relationships