Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Neuml Txtai Production Deployment

From Leeroopedia


Overview

Deploying txtai APIs to production involves selecting an appropriate hosting strategy based on scalability, latency, and cost requirements. txtai supports three primary deployment modes: Docker containerization for standard server deployments, AWS Lambda for serverless execution, and distributed clustering for horizontally sharding large indexes across multiple nodes. Each mode builds on the same YAML configuration and FastAPI application foundation.

Theoretical Foundation

Containerization

Docker containerization is the recommended approach for production txtai deployments. The container encapsulates:

  • The Python runtime and all dependencies
  • The txtai library and its model dependencies
  • Pre-cached model weights (downloaded at build time)
  • The YAML configuration file

By caching models in the container image during the build phase, txtai containers achieve fast cold starts -- the container can begin serving requests immediately without downloading multi-gigabyte model files at runtime.

The txtai Docker architecture uses a multi-stage build pattern:

  1. Base image (docker/base/Dockerfile): installs Python, system libraries, PyTorch, and txtai with all dependencies
  2. API image (docker/api/Dockerfile): extends the base image with a specific configuration, pre-caches models, and sets the uvicorn entrypoint

This separation allows the heavyweight base image to be reused across multiple application configurations.

Serverless Deployment

For workloads with variable traffic patterns, txtai supports AWS Lambda deployment through the Mangum ASGI adapter. Mangum translates AWS Lambda events into ASGI requests that FastAPI can process, enabling the same txtai application to run in a serverless context.

The serverless architecture involves:

  1. The Lambda function handler wraps the FastAPI application with Mangum
  2. AWS Lambda Runtime Interface Client (awslambdaric) manages the Lambda lifecycle
  3. The start() function manually triggers the FastAPI lifespan handler (since Lambda does not use a standard ASGI server)
  4. The container image includes pre-cached models to minimize cold start time

Key considerations for serverless txtai deployment:

  • Cold start latency: model loading adds significant startup time; pre-caching in the container mitigates this
  • Memory limits: Lambda functions must be configured with sufficient memory for model inference
  • Execution timeout: long-running operations (e.g., large index builds) may exceed Lambda's timeout
  • Statelessness: the embeddings index must be stored externally (e.g., S3) since Lambda instances are ephemeral

Distributed Search Sharding

For indexes that exceed a single node's capacity, txtai implements distributed clustering. A cluster aggregates multiple txtai API instances (shards) into a single logical embeddings index:

  • Write operations distribute documents across shards using consistent hashing (Adler-32 on document IDs)
  • Read operations fan out queries to all shards in parallel and aggregate results
  • Count operations sum counts across all shards

The sharding strategy ensures:

  • Even distribution: documents are spread uniformly across shards based on their ID hash
  • Deterministic placement: the same document ID always maps to the same shard
  • Parallel execution: queries run concurrently on all shards via async HTTP requests

Deployment Architecture

Docker Deployment

The standard Docker deployment uses uvicorn as the ASGI server:

+------------------+
|  Docker Container |
|  +-------------+ |
|  | uvicorn     | |
|  |  +--------+ | |
|  |  | FastAPI| | |
|  |  | txtai  | | |
|  |  +--------+ | |
|  +-------------+ |
|  | config.yml   | |
|  | cached models| |
+------------------+

The container exposes a single HTTP port and serves the full txtai API. Scaling is achieved by running multiple container instances behind a load balancer.

Serverless Deployment

The AWS Lambda deployment replaces uvicorn with Mangum:

+-----------------------------+
|  Lambda Container           |
|  +------------------------+ |
|  | awslambdaric           | |
|  |  +-------------------+ | |
|  |  | Mangum            | | |
|  |  |  +--------------+ | | |
|  |  |  | FastAPI      | | | |
|  |  |  | txtai        | | | |
|  |  |  +--------------+ | | |
|  |  +-------------------+ | |
|  +------------------------+ |
|  | config.yml              | |
|  | cached models           | |
+-----------------------------+

API Gateway or Lambda Function URLs route HTTP requests to the Lambda function, which processes them through the same FastAPI application.

Cluster Deployment

Distributed clustering uses a coordinator pattern:

                   +-------------------+
                   |  Coordinator API  |
                   |  (Cluster class)  |
                   +--------+----------+
                            |
              +-------------+-------------+
              |             |             |
     +--------v--+  +-------v---+  +------v----+
     |  Shard 0  |  |  Shard 1  |  |  Shard 2  |
     |  txtai    |  |  txtai    |  |  txtai     |
     |  API      |  |  API      |  |  API       |
     +-----------+  +-----------+  +------------+

The coordinator does not hold any embeddings data itself. It distributes writes and fans out reads to the underlying shards, then aggregates results.

Configuration Patterns

Docker Configuration

# config.yml for Docker deployment
path: /data/index
writable: true

embeddings:
  path: sentence-transformers/all-MiniLM-L6-v2
  content: true

Cluster Configuration

# config.yml for coordinator node
cluster:
  shards:
    - http://shard-0:8000
    - http://shard-1:8000
    - http://shard-2:8000

Lambda Configuration

The Lambda deployment uses the same YAML configuration as Docker, but the handler is different:

from mangum import Mangum
from txtai.api import app, start

# Manually trigger lifespan startup
start()

# Wrap FastAPI app for Lambda
handler = Mangum(app, lifespan="off")

Model Caching Strategy

A critical production deployment technique is pre-caching models in the container image:

# Cache models during build (without loading index data)
RUN python -c "from txtai.api import API; API('config.yml', False)"

This line in the Dockerfile:

  1. Instantiates the API class with loaddata=False
  2. Downloads all referenced models (transformers, tokenizers, etc.)
  3. Stores them in the container's filesystem (typically in ~/.cache)
  4. Does not load or create any index data

The resulting container image is larger but starts much faster because no network downloads are needed at runtime.

Design Rationale

Why Multiple Deployment Modes

Different deployment contexts have different requirements:

Requirement Docker Lambda Cluster
Low latency Yes (persistent process) No (cold starts) Yes (parallel shards)
Cost efficiency at low traffic No (always running) Yes (pay per request) No (multiple nodes)
Large index support Limited by node memory Very limited Yes (horizontal scaling)
Operational simplicity High Medium Lower (multiple nodes)
Auto-scaling Via orchestrator Built-in Manual shard management

Why Adler-32 for Sharding

The Cluster class uses Adler-32 (via zlib.adler32) to hash string document IDs to shard indices. Adler-32 was chosen because:

  • It is fast -- significantly faster than cryptographic hashes
  • It provides reasonable distribution for typical document ID patterns
  • It is deterministic -- the same ID always maps to the same shard
  • It is available in Python's standard library with no additional dependencies

See Also

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment