Workflow:PacktPublishing LLM Engineers Handbook RAG Inference

Knowledge Sources	LLM Engineers Handbook Qdrant Docs FastAPI Docs AWS SageMaker Inference Docs
Domains	LLMs, RAG, Inference, LLM_Ops
Last Updated	2026-02-08 07:45 GMT

Overview

End-to-end process for serving the fine-tuned LLM Twin model through a RAG-augmented FastAPI endpoint that retrieves relevant context from Qdrant and generates personalized responses via a SageMaker inference endpoint.

Description

This workflow deploys the fine-tuned model to AWS SageMaker as an inference endpoint and serves it through a FastAPI REST API with Retrieval-Augmented Generation. When a user query arrives, the RAG pipeline performs self-query metadata extraction, query expansion, parallel vector search across content types in Qdrant, cross-encoder reranking, and finally sends the enriched context with the query to the SageMaker-hosted LLM for generation. All operations are traced using Opik for observability.

Usage

Execute this workflow after the fine-tuned model has been deployed to a SageMaker inference endpoint. You need the Qdrant vector database populated with embedded chunks (from the Feature Engineering pipeline) and the SageMaker endpoint active. This provides the production inference interface for the LLM Twin system.

Execution Steps

Step 1: Model Deployment to SageMaker

Deploy the fine-tuned HuggingFace model to an AWS SageMaker inference endpoint. This uses the SageMaker HuggingFace LLM image with a strategy pattern implementation that handles endpoint creation, configuration, and model registration. Auto-scaling can be optionally configured.

Key considerations:

Uses the HuggingFace LLM inference image (version 2.2.0) for optimized serving
Deployment follows a strategy pattern (SagemakerHuggingfaceStrategy) for clean separation
ResourceManager checks for existing endpoints to avoid duplicates
Endpoint name and config name are controlled via application settings
GPU instance type is configurable (default specified in settings)

Step 2: Self-Query Metadata Extraction

When a user query arrives at the FastAPI endpoint, the first RAG step extracts structured metadata from the natural language query. Specifically, it identifies the author's full name mentioned in the query using an LLM prompt, enabling filtered vector search scoped to a specific author's content.

Key considerations:

Uses GPT-4o-mini to parse author identity from the query
The extracted author_full_name is resolved to an author_id for Qdrant filtering
Queries without an identifiable author proceed without filtering
This step is tracked by Opik for observability

Step 3: Query Expansion

Generate multiple reformulations of the original query to improve search recall. The query expander uses an LLM to produce semantically similar but lexically diverse variations of the query, each of which will be searched independently.

Key considerations:

Expands to a configurable number of queries (default: 3)
Each expanded query captures different aspects or phrasings of the user's intent
More queries increase recall but also increase search latency and cost

Step 4: Parallel Vector Search

Execute vector similarity searches in Qdrant for each expanded query, searching across all three content type collections (articles, posts, repositories) in parallel. Each search returns the top-K most similar embedded chunks, optionally filtered by author ID.

Key considerations:

Searches run in parallel using ThreadPoolExecutor for low latency
Each query searches three collections: EmbeddedArticleChunk, EmbeddedPostChunk, EmbeddedRepositoryChunk
Results are deduplicated across queries using set operations
Author filtering is applied via Qdrant's FieldCondition when author_id is available
k is divided equally across the three content types (k//3 per collection)

Step 5: Cross-Encoder Reranking

Rerank the combined search results using a cross-encoder model to select the most relevant chunks. The cross-encoder scores each query-chunk pair for semantic relevance, providing more accurate ranking than the initial vector similarity search.

Key considerations:

The CrossEncoderModelSingleton ensures the reranking model is loaded once
Reranking selects the top-K most relevant chunks from all retrieved results
Cross-encoder scoring is more accurate than bi-encoder similarity but slower
Only applied when retrieved results are non-empty

Step 6: Context Assembly and LLM Generation

Assemble the reranked chunks into a context string and send it along with the original query to the SageMaker-hosted LLM Twin model for answer generation. The InferenceExecutor handles prompt formatting and parameter configuration for the SageMaker endpoint.

Key considerations:

Context is formatted from embedded chunks using the to_context class method
The LLMInferenceSagemakerEndpoint handles the SageMaker API communication
InferenceExecutor wraps the query, context, and generation parameters
Opik traces capture metadata: model ID, embedding model, temperature, token counts

Step 7: API Response

Return the generated answer through the FastAPI REST API as a structured JSON response. The endpoint handles errors gracefully and returns appropriate HTTP status codes.

Key considerations:

The /rag POST endpoint accepts a QueryRequest with a query string
Responses are structured as QueryResponse with an answer field
Errors return HTTP 500 with the exception detail
The FastAPI server is launched via Uvicorn through the ml_service tool script

Execution Diagram

GitHub URL

Workflow Repository