Principle:Neuml Txtai Graph Network
| Knowledge Sources | |
|---|---|
| Domains | Knowledge_Graph, Graph_Analysis |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Graph Network is txtai's graph layer that constructs relationship networks from indexed documents, using community detection to automatically discover topics and enabling graph-based traversal of semantically connected content.
Description
Beyond flat vector search and SQL filtering, txtai provides a graph layer that models relationships between indexed documents as a network. When graph mode is enabled, txtai builds a graph where each document is a node and edges represent semantic relationships -- typically, documents that are sufficiently similar (above a configurable similarity threshold) or that share explicit metadata links. This transforms the document collection from a flat searchable index into a navigable knowledge graph where users can explore connections, discover related topics, and traverse semantic neighborhoods.
The graph layer is implemented through a Graph base class with two concrete backends:
- NetworkX -- In-process graph computation suitable for datasets up to a few hundred thousand documents. Provides the full NetworkX algorithm library for analysis.
- Apache AGE -- PostgreSQL-backed graph storage suitable for larger deployments that need persistent, queryable graph storage with Cypher query support and concurrent access.
Both backends implement the same operations: addNodes (create document nodes), addEdges (create relationship edges), search (find subgraphs matching a query), communities (detect topic clusters), and centrality (rank node importance).
A key feature of the graph layer is automatic topic extraction via community detection. After the graph is constructed, a community detection algorithm (typically Louvain or Label Propagation) partitions the graph into densely connected clusters. Each cluster represents a topic -- a group of semantically related documents. The most central document in each community (by PageRank or degree centrality) serves as the topic representative. These automatically discovered topics can be used for faceted browsing, content organization, and exploratory search without any manual taxonomy creation.
The graph is constructed incrementally as documents are indexed. When new documents are added, the system computes their nearest neighbors from the existing index, creates edges to sufficiently similar existing documents, and assigns the new nodes to communities. When documents are deleted, their nodes and all incident edges are removed, and affected communities are optionally re-computed. This incremental approach avoids rebuilding the entire graph for each index update.
Usage
Enable the graph layer when you need to explore relationships between documents, discover topical clusters within a corpus, or provide graph-based navigation in a search interface. The graph is particularly valuable for research corpora, knowledge bases, and document collections where understanding connections between items is as important as finding individual items. Disable the graph for simple search-and-retrieve workloads where relational structure is not needed, as graph construction adds indexing overhead proportional to the number of edges created.
Theoretical Basis
1. Graph Construction from Embeddings: Given a set of n document embeddings, the graph is constructed by computing pairwise similarity scores and creating edges between documents whose similarity exceeds a threshold t. To avoid an O(n^2) all-pairs computation, the ANN index is used to find the top-k nearest neighbors for each document, and edges are created only for pairs within this neighbor set that exceed the threshold. The resulting graph is sparse, with average degree proportional to k rather than n, keeping memory usage linear in the number of documents.
2. Community Detection Algorithms: The Louvain algorithm optimizes modularity by iteratively moving nodes between communities to maximize the density of intra-community edges relative to inter-community edges. Modularity is defined as:
Q = (1/2m) * SUM[(A_ij - k_i*k_j/2m) * delta(c_i, c_j)]
where A_ij is the adjacency matrix, k_i is the degree of node i, m is the total number of edges, and delta is 1 if nodes i and j are in the same community. The algorithm runs in O(n log n) time. Label Propagation is a faster alternative that assigns each node the label held by the majority of its neighbors, converging in near-linear time but producing less stable results across runs.
3. Centrality Metrics: Several centrality measures determine node importance within the graph:
- PageRank -- Models importance as the stationary distribution of a random walk with damping factor d (typically 0.85), favoring nodes linked to by other important nodes
- Degree centrality -- Counts the number of edges per node, normalized by n-1
- Betweenness centrality -- Measures how often a node lies on the shortest path between other nodes, identifying bridge documents that connect different topics
4. Topic-Graph Duality: The graph layer establishes a duality between topics and subgraphs. Each community (subgraph) defines a topic, and each topic corresponds to a densely connected region of the graph. This duality enables two query modes: (a) topic-first -- retrieve the community containing a query document and return all related documents, and (b) graph-first -- traverse edges from a seed document to explore the local neighborhood, regardless of topic boundaries.
5. Graph Persistence and Querying: With the NetworkX backend, the graph is serialized alongside the embeddings index using Python's pickle format, producing a single file that can be loaded back into memory. With the Apache AGE backend, the graph is stored as a labeled property graph in PostgreSQL, enabling Cypher queries for complex traversal patterns:
MATCH (a)-[:SIMILAR_TO]->(b) WHERE a.topic = "NLP" RETURN b
The AGE backend also supports concurrent read access from multiple application instances.