Implementation:PacktPublishing LLM Engineers Handbook FastAPI RAG Endpoint

Field	Value
Type	API Doc
Workflow	RAG_Inference
Repository	PacktPublishing/LLM-Engineers-Handbook
Source	inference_pipeline_api.py:L37-66
Implements	Principle:PacktPublishing_LLM_Engineers_Handbook_REST_API_Serving

API Signature

@app.post("/rag", response_model=QueryResponse)
async def rag_endpoint(request: QueryRequest) -> dict

Import

from fastapi import FastAPI, HTTPException

Key Code

app = FastAPI()


class QueryRequest(BaseModel):
    query: str


class QueryResponse(BaseModel):
    answer: str


@opik.track
def rag(query: str) -> str:
    retriever = ContextRetriever(mock=False)
    documents = retriever.search(query, k=3)
    context = EmbeddedChunk.to_context(documents)
    answer = call_llm_service(query, context)
    return answer


@app.post("/rag", response_model=QueryResponse)
async def rag_endpoint(request: QueryRequest):
    try:
        answer = rag(query=request.query)
        return {"answer": answer}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e)) from e

Request and Response Schemas

Request

Field	Type	Description
query	str	The natural language question to answer

Example request:

{
    "query": "What are the best practices for fine-tuning LLMs?"
}

Response

Field	Type	Description
answer	str	The generated answer grounded in retrieved context

Example response:

{
    "answer": "The best practices for fine-tuning LLMs include..."
}

Inputs and Outputs

Inputs:

HTTP POST request with JSON body {"query": "..."}

Outputs:

HTTP response with JSON body {"answer": "..."}
HTTP 500 with error detail if the pipeline fails

How It Works

The FastAPI app receives a POST request at the /rag endpoint
The request body is validated against the QueryRequest Pydantic model
The rag() function is called, which:
- Creates a ContextRetriever instance to handle the full retrieval pipeline (self-query, query expansion, vector search, reranking)
- Calls retriever.search() to retrieve the top-K relevant document chunks
- Converts the retrieved chunks to a context string via EmbeddedChunk.to_context()
- Calls call_llm_service() to generate an answer using the context and query
The @opik.track decorator records the entire RAG call for observability (prompts, latencies, token counts)
The answer is returned in the QueryResponse schema
Any exceptions are caught and returned as HTTP 500 errors with descriptive messages

External Dependencies

fastapi - Web framework providing async HTTP endpoints and automatic OpenAPI documentation
opik - Observability platform for tracking prompts, latencies, and token usage
pydantic - Data validation for request and response schemas

Source File

llm_engineering/infrastructure/inference_pipeline_api.py (lines 37-66)

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment

API Signature

Import

Key Code

Request and Response Schemas

Request

Response

Inputs and Outputs

How It Works

External Dependencies

Source File

See Also

Page Connections