Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Marker Inc Korea AutoRAG REST API Deployment

From Leeroopedia
Knowledge Sources
Domains RAG Pipeline Deployment, REST API Design
Last Updated 2026-02-12 00:00 GMT

Overview

REST API deployment exposes an optimized RAG pipeline as an HTTP service with structured endpoints for full pipeline queries, retrieval-only queries, streaming generation, and version reporting.

Description

After an AutoRAG pipeline has been optimized and its best configuration extracted, the next step for production use is typically to serve it as an HTTP API. REST API deployment wraps the pipeline runner in an asynchronous web server, providing a standardized interface that any client application can consume regardless of programming language or platform.

The API design follows a resource-oriented pattern with four distinct endpoints, each addressing a different use case. The run endpoint executes the full pipeline and returns both the generated answer and the retrieved passages with metadata. The retrieve endpoint executes only the retrieval and reranking stages, skipping prompt construction and generation -- useful for applications that want to handle generation separately or display source documents. The stream endpoint provides server-sent streaming of both retrieved passages and generated text tokens, enabling real-time UI updates. The version endpoint reports the AutoRAG library version for compatibility checks.

A key architectural decision is the use of pydantic models for request/response validation. The QueryRequest model validates incoming queries, while RunResponse, RetrievalResponse, and StreamResponse models ensure consistent, documented output formats. The RetrievedPassage model provides rich metadata including document content, ID, relevance score, file path, page number, and character-level start/end indices.

Usage

Use REST API deployment when you need to integrate the RAG pipeline into a larger system, serve it to multiple clients, or expose it over a network. It is the primary production deployment strategy. The optional ngrok tunnel support (via the remote parameter) enables quick prototyping by creating a public URL without infrastructure setup.

Theoretical Basis

The API follows standard REST conventions with JSON request/response bodies over HTTP POST (for queries) and HTTP GET (for metadata):

Endpoints:
  POST /v1/run       -> Full pipeline execution (query -> answer + passages)
  POST /v1/retrieve  -> Retrieval-only execution (query -> passages)
  POST /v1/stream    -> Streaming generation (query -> SSE of passages + tokens)
  GET  /version      -> Library version metadata

Streaming architecture:

The stream endpoint uses an async generator pattern. For each module in the pipeline:

  • If the module is not a generator, it executes normally and accumulates results.
  • If the module is a generator, retrieved passages are yielded first as individual JSON objects, then the generator's astream method is called to yield text tokens incrementally.

This design ensures that clients receive retrieved context immediately, before waiting for the full generation to complete.

Key design properties:

  • Async execution: The server is built on Quart (an async Flask-compatible framework), enabling concurrent request handling.
  • Validated contracts: Pydantic models enforce type safety on both input and output.
  • Passage metadata: Retrieved passages include provenance information (filepath, page, character indices) for transparency and citation.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment