Principle:PacktPublishing LLM Engineers Handbook REST API Serving
| Field | Value |
|---|---|
| Concept | Exposing ML pipelines as REST APIs |
| Category | Infrastructure / API Serving |
| Workflow | RAG_Inference |
| Repository | PacktPublishing/LLM-Engineers-Handbook |
| Implemented by | Implementation:PacktPublishing_LLM_Engineers_Handbook_FastAPI_RAG_Endpoint |
Overview
API Serving for ML is the practice of wrapping the RAG inference pipeline in a REST API endpoint using FastAPI. This provides a standard HTTP interface for client applications. The endpoint handles request parsing, pipeline orchestration (retrieval followed by generation), error handling, and response formatting. Observability integration (Opik) tracks prompts, latencies, and token counts.
Theory
REST API Design for ML
Serving ML models via REST APIs involves several design considerations:
- Request/Response schema - Using Pydantic models to define typed request and response structures ensures input validation and clear API contracts
- Synchronous vs. asynchronous - FastAPI's async support allows handling concurrent requests efficiently, though the underlying ML computation may be synchronous
- Error handling - ML pipelines can fail at multiple stages (retrieval, generation); proper HTTP error codes and messages help clients handle failures gracefully
- Statelessness - Each request contains all information needed to process it, enabling horizontal scaling
Pipeline Orchestration
The API endpoint orchestrates the full RAG pipeline in a single request:
- Query reception - Parse the incoming JSON request
- Context retrieval - Run the retrieval pipeline (self-query, query expansion, vector search, reranking)
- LLM generation - Assemble context and generate the answer
- Response formatting - Return the answer in the defined response schema
Observability
Production ML APIs require observability beyond standard web metrics:
- Prompt tracking - Recording the exact prompts sent to the LLM
- Latency breakdown - Measuring time spent in retrieval vs. generation
- Token counts - Tracking input and output token usage for cost monitoring
- Tracing - End-to-end request tracing through all pipeline stages
When to Use
- When exposing the RAG pipeline as a web service for client applications
- When building microservice architectures where the ML pipeline is a standalone service
- When clients need a standard HTTP interface to interact with the inference system
- When observability and monitoring of the inference pipeline are required
Related Concepts
- FastAPI - modern Python web framework with automatic OpenAPI documentation
- Pydantic - data validation and settings management using Python type annotations
- Microservice architecture - decomposing applications into independently deployable services
- API gateway - centralized entry point for API traffic management