Implementation:PacktPublishing LLM Engineers Handbook FastAPI RAG Endpoint
Appearance
| Field | Value |
|---|---|
| Type | API Doc |
| Workflow | RAG_Inference |
| Repository | PacktPublishing/LLM-Engineers-Handbook |
| Source | inference_pipeline_api.py:L37-66 |
| Implements | Principle:PacktPublishing_LLM_Engineers_Handbook_REST_API_Serving |
API Signature
@app.post("/rag", response_model=QueryResponse)
async def rag_endpoint(request: QueryRequest) -> dict
Import
from fastapi import FastAPI, HTTPException
Key Code
app = FastAPI()
class QueryRequest(BaseModel):
query: str
class QueryResponse(BaseModel):
answer: str
@opik.track
def rag(query: str) -> str:
retriever = ContextRetriever(mock=False)
documents = retriever.search(query, k=3)
context = EmbeddedChunk.to_context(documents)
answer = call_llm_service(query, context)
return answer
@app.post("/rag", response_model=QueryResponse)
async def rag_endpoint(request: QueryRequest):
try:
answer = rag(query=request.query)
return {"answer": answer}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e)) from e
Request and Response Schemas
Request
| Field | Type | Description |
|---|---|---|
| query | str | The natural language question to answer |
Example request:
{
"query": "What are the best practices for fine-tuning LLMs?"
}
Response
| Field | Type | Description |
|---|---|---|
| answer | str | The generated answer grounded in retrieved context |
Example response:
{
"answer": "The best practices for fine-tuning LLMs include..."
}
Inputs and Outputs
Inputs:
- HTTP POST request with JSON body
{"query": "..."}
Outputs:
- HTTP response with JSON body
{"answer": "..."} - HTTP 500 with error detail if the pipeline fails
How It Works
- The FastAPI app receives a POST request at the
/ragendpoint - The request body is validated against the QueryRequest Pydantic model
- The rag() function is called, which:
- Creates a ContextRetriever instance to handle the full retrieval pipeline (self-query, query expansion, vector search, reranking)
- Calls retriever.search() to retrieve the top-K relevant document chunks
- Converts the retrieved chunks to a context string via
EmbeddedChunk.to_context() - Calls call_llm_service() to generate an answer using the context and query
- The @opik.track decorator records the entire RAG call for observability (prompts, latencies, token counts)
- The answer is returned in the QueryResponse schema
- Any exceptions are caught and returned as HTTP 500 errors with descriptive messages
External Dependencies
- fastapi - Web framework providing async HTTP endpoints and automatic OpenAPI documentation
- opik - Observability platform for tracking prompts, latencies, and token usage
- pydantic - Data validation for request and response schemas
Source File
llm_engineering/infrastructure/inference_pipeline_api.py(lines 37-66)
See Also
- Principle:PacktPublishing_LLM_Engineers_Handbook_REST_API_Serving
- Environment:PacktPublishing_LLM_Engineers_Handbook_Python_3_11_Poetry_Environment
- Environment:PacktPublishing_LLM_Engineers_Handbook_Docker_MongoDB_Qdrant_Infrastructure
- Environment:PacktPublishing_LLM_Engineers_Handbook_API_Credentials
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment