Principle:Neuml Txtai Production Deployment

Overview

Deploying txtai APIs to production involves selecting an appropriate hosting strategy based on scalability, latency, and cost requirements. txtai supports three primary deployment modes: Docker containerization for standard server deployments, AWS Lambda for serverless execution, and distributed clustering for horizontally sharding large indexes across multiple nodes. Each mode builds on the same YAML configuration and FastAPI application foundation.

Theoretical Foundation

Containerization

Docker containerization is the recommended approach for production txtai deployments. The container encapsulates:

The Python runtime and all dependencies
The txtai library and its model dependencies
Pre-cached model weights (downloaded at build time)
The YAML configuration file

By caching models in the container image during the build phase, txtai containers achieve fast cold starts -- the container can begin serving requests immediately without downloading multi-gigabyte model files at runtime.

The txtai Docker architecture uses a multi-stage build pattern:

Base image (docker/base/Dockerfile): installs Python, system libraries, PyTorch, and txtai with all dependencies
API image (docker/api/Dockerfile): extends the base image with a specific configuration, pre-caches models, and sets the uvicorn entrypoint

This separation allows the heavyweight base image to be reused across multiple application configurations.

Serverless Deployment

For workloads with variable traffic patterns, txtai supports AWS Lambda deployment through the Mangum ASGI adapter. Mangum translates AWS Lambda events into ASGI requests that FastAPI can process, enabling the same txtai application to run in a serverless context.

The serverless architecture involves:

The Lambda function handler wraps the FastAPI application with Mangum
AWS Lambda Runtime Interface Client (awslambdaric) manages the Lambda lifecycle
The start() function manually triggers the FastAPI lifespan handler (since Lambda does not use a standard ASGI server)
The container image includes pre-cached models to minimize cold start time

Key considerations for serverless txtai deployment:

Cold start latency: model loading adds significant startup time; pre-caching in the container mitigates this
Memory limits: Lambda functions must be configured with sufficient memory for model inference
Execution timeout: long-running operations (e.g., large index builds) may exceed Lambda's timeout
Statelessness: the embeddings index must be stored externally (e.g., S3) since Lambda instances are ephemeral

Distributed Search Sharding

For indexes that exceed a single node's capacity, txtai implements distributed clustering. A cluster aggregates multiple txtai API instances (shards) into a single logical embeddings index:

Write operations distribute documents across shards using consistent hashing (Adler-32 on document IDs)
Read operations fan out queries to all shards in parallel and aggregate results
Count operations sum counts across all shards

The sharding strategy ensures:

Even distribution: documents are spread uniformly across shards based on their ID hash
Deterministic placement: the same document ID always maps to the same shard
Parallel execution: queries run concurrently on all shards via async HTTP requests

Deployment Architecture

Docker Deployment

The standard Docker deployment uses uvicorn as the ASGI server:

+------------------+
|  Docker Container |
|  +-------------+ |
|  | uvicorn     | |
|  |  +--------+ | |
|  |  | FastAPI| | |
|  |  | txtai  | | |
|  |  +--------+ | |
|  +-------------+ |
|  | config.yml   | |
|  | cached models| |
+------------------+

The container exposes a single HTTP port and serves the full txtai API. Scaling is achieved by running multiple container instances behind a load balancer.

Serverless Deployment

The AWS Lambda deployment replaces uvicorn with Mangum:

+-----------------------------+
|  Lambda Container           |
|  +------------------------+ |
|  | awslambdaric           | |
|  |  +-------------------+ | |
|  |  | Mangum            | | |
|  |  |  +--------------+ | | |
|  |  |  | FastAPI      | | | |
|  |  |  | txtai        | | | |
|  |  |  +--------------+ | | |
|  |  +-------------------+ | |
|  +------------------------+ |
|  | config.yml              | |
|  | cached models           | |
+-----------------------------+

API Gateway or Lambda Function URLs route HTTP requests to the Lambda function, which processes them through the same FastAPI application.

Cluster Deployment

Distributed clustering uses a coordinator pattern:

                   +-------------------+
                   |  Coordinator API  |
                   |  (Cluster class)  |
                   +--------+----------+
                            |
              +-------------+-------------+
              |             |             |
     +--------v--+  +-------v---+  +------v----+
     |  Shard 0  |  |  Shard 1  |  |  Shard 2  |
     |  txtai    |  |  txtai    |  |  txtai     |
     |  API      |  |  API      |  |  API       |
     +-----------+  +-----------+  +------------+

The coordinator does not hold any embeddings data itself. It distributes writes and fans out reads to the underlying shards, then aggregates results.

Configuration Patterns

Docker Configuration

# config.yml for Docker deployment
path: /data/index
writable: true

embeddings:
  path: sentence-transformers/all-MiniLM-L6-v2
  content: true

Cluster Configuration

# config.yml for coordinator node
cluster:
  shards:
    - http://shard-0:8000
    - http://shard-1:8000
    - http://shard-2:8000

Lambda Configuration

The Lambda deployment uses the same YAML configuration as Docker, but the handler is different:

from mangum import Mangum
from txtai.api import app, start

# Manually trigger lifespan startup
start()

# Wrap FastAPI app for Lambda
handler = Mangum(app, lifespan="off")

Model Caching Strategy

A critical production deployment technique is pre-caching models in the container image:

# Cache models during build (without loading index data)
RUN python -c "from txtai.api import API; API('config.yml', False)"

This line in the Dockerfile:

Instantiates the API class with loaddata=False
Downloads all referenced models (transformers, tokenizers, etc.)
Stores them in the container's filesystem (typically in ~/.cache)
Does not load or create any index data

The resulting container image is larger but starts much faster because no network downloads are needed at runtime.

Design Rationale

Why Multiple Deployment Modes

Different deployment contexts have different requirements:

Requirement	Docker	Lambda	Cluster
Low latency	Yes (persistent process)	No (cold starts)	Yes (parallel shards)
Cost efficiency at low traffic	No (always running)	Yes (pay per request)	No (multiple nodes)
Large index support	Limited by node memory	Very limited	Yes (horizontal scaling)
Operational simplicity	High	Medium	Lower (multiple nodes)
Auto-scaling	Via orchestrator	Built-in	Manual shard management

Why Adler-32 for Sharding

The Cluster class uses Adler-32 (via zlib.adler32) to hash string document IDs to shard indices. Adler-32 was chosen because:

It is fast -- significantly faster than cryptographic hashes
It provides reasonable distribution for typical document ID patterns
It is deterministic -- the same ID always maps to the same shard
It is available in Python's standard library with no additional dependencies

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment

Principle:Neuml Txtai Production Deployment

Overview

Theoretical Foundation

Containerization

Serverless Deployment

Distributed Search Sharding

Deployment Architecture

Docker Deployment

Serverless Deployment

Cluster Deployment

Configuration Patterns

Docker Configuration

Cluster Configuration

Lambda Configuration

Model Caching Strategy

Design Rationale

Why Multiple Deployment Modes

Why Adler-32 for Sharding

See Also

Implemented By

Page Connections