Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:PacktPublishing LLM Engineers Handbook FastAPI RAG Endpoint

From Leeroopedia


Field Value
Type API Doc
Workflow RAG_Inference
Repository PacktPublishing/LLM-Engineers-Handbook
Source inference_pipeline_api.py:L37-66
Implements Principle:PacktPublishing_LLM_Engineers_Handbook_REST_API_Serving

API Signature

@app.post("/rag", response_model=QueryResponse)
async def rag_endpoint(request: QueryRequest) -> dict

Import

from fastapi import FastAPI, HTTPException

Key Code

app = FastAPI()


class QueryRequest(BaseModel):
    query: str


class QueryResponse(BaseModel):
    answer: str


@opik.track
def rag(query: str) -> str:
    retriever = ContextRetriever(mock=False)
    documents = retriever.search(query, k=3)
    context = EmbeddedChunk.to_context(documents)
    answer = call_llm_service(query, context)
    return answer


@app.post("/rag", response_model=QueryResponse)
async def rag_endpoint(request: QueryRequest):
    try:
        answer = rag(query=request.query)
        return {"answer": answer}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e)) from e

Request and Response Schemas

Request

Field Type Description
query str The natural language question to answer

Example request:

{
    "query": "What are the best practices for fine-tuning LLMs?"
}

Response

Field Type Description
answer str The generated answer grounded in retrieved context

Example response:

{
    "answer": "The best practices for fine-tuning LLMs include..."
}

Inputs and Outputs

Inputs:

  • HTTP POST request with JSON body {"query": "..."}

Outputs:

  • HTTP response with JSON body {"answer": "..."}
  • HTTP 500 with error detail if the pipeline fails

How It Works

  1. The FastAPI app receives a POST request at the /rag endpoint
  2. The request body is validated against the QueryRequest Pydantic model
  3. The rag() function is called, which:
    • Creates a ContextRetriever instance to handle the full retrieval pipeline (self-query, query expansion, vector search, reranking)
    • Calls retriever.search() to retrieve the top-K relevant document chunks
    • Converts the retrieved chunks to a context string via EmbeddedChunk.to_context()
    • Calls call_llm_service() to generate an answer using the context and query
  4. The @opik.track decorator records the entire RAG call for observability (prompts, latencies, token counts)
  5. The answer is returned in the QueryResponse schema
  6. Any exceptions are caught and returned as HTTP 500 errors with descriptive messages

External Dependencies

  • fastapi - Web framework providing async HTTP endpoints and automatic OpenAPI documentation
  • opik - Observability platform for tracking prompts, latencies, and token usage
  • pydantic - Data validation for request and response schemas

Source File

  • llm_engineering/infrastructure/inference_pipeline_api.py (lines 37-66)

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment