Workflow:PacktPublishing LLM Engineers Handbook RAG Inference
| Knowledge Sources | |
|---|---|
| Domains | LLMs, RAG, Inference, LLM_Ops |
| Last Updated | 2026-02-08 07:45 GMT |
Overview
End-to-end process for serving the fine-tuned LLM Twin model through a RAG-augmented FastAPI endpoint that retrieves relevant context from Qdrant and generates personalized responses via a SageMaker inference endpoint.
Description
This workflow deploys the fine-tuned model to AWS SageMaker as an inference endpoint and serves it through a FastAPI REST API with Retrieval-Augmented Generation. When a user query arrives, the RAG pipeline performs self-query metadata extraction, query expansion, parallel vector search across content types in Qdrant, cross-encoder reranking, and finally sends the enriched context with the query to the SageMaker-hosted LLM for generation. All operations are traced using Opik for observability.
Usage
Execute this workflow after the fine-tuned model has been deployed to a SageMaker inference endpoint. You need the Qdrant vector database populated with embedded chunks (from the Feature Engineering pipeline) and the SageMaker endpoint active. This provides the production inference interface for the LLM Twin system.
Execution Steps
Step 1: Model Deployment to SageMaker
Deploy the fine-tuned HuggingFace model to an AWS SageMaker inference endpoint. This uses the SageMaker HuggingFace LLM image with a strategy pattern implementation that handles endpoint creation, configuration, and model registration. Auto-scaling can be optionally configured.
Key considerations:
- Uses the HuggingFace LLM inference image (version 2.2.0) for optimized serving
- Deployment follows a strategy pattern (SagemakerHuggingfaceStrategy) for clean separation
- ResourceManager checks for existing endpoints to avoid duplicates
- Endpoint name and config name are controlled via application settings
- GPU instance type is configurable (default specified in settings)
Step 2: Self-Query Metadata Extraction
When a user query arrives at the FastAPI endpoint, the first RAG step extracts structured metadata from the natural language query. Specifically, it identifies the author's full name mentioned in the query using an LLM prompt, enabling filtered vector search scoped to a specific author's content.
Key considerations:
- Uses GPT-4o-mini to parse author identity from the query
- The extracted author_full_name is resolved to an author_id for Qdrant filtering
- Queries without an identifiable author proceed without filtering
- This step is tracked by Opik for observability
Step 3: Query Expansion
Generate multiple reformulations of the original query to improve search recall. The query expander uses an LLM to produce semantically similar but lexically diverse variations of the query, each of which will be searched independently.
Key considerations:
- Expands to a configurable number of queries (default: 3)
- Each expanded query captures different aspects or phrasings of the user's intent
- More queries increase recall but also increase search latency and cost
Step 4: Parallel Vector Search
Execute vector similarity searches in Qdrant for each expanded query, searching across all three content type collections (articles, posts, repositories) in parallel. Each search returns the top-K most similar embedded chunks, optionally filtered by author ID.
Key considerations:
- Searches run in parallel using ThreadPoolExecutor for low latency
- Each query searches three collections: EmbeddedArticleChunk, EmbeddedPostChunk, EmbeddedRepositoryChunk
- Results are deduplicated across queries using set operations
- Author filtering is applied via Qdrant's FieldCondition when author_id is available
- k is divided equally across the three content types (k//3 per collection)
Step 5: Cross-Encoder Reranking
Rerank the combined search results using a cross-encoder model to select the most relevant chunks. The cross-encoder scores each query-chunk pair for semantic relevance, providing more accurate ranking than the initial vector similarity search.
Key considerations:
- The CrossEncoderModelSingleton ensures the reranking model is loaded once
- Reranking selects the top-K most relevant chunks from all retrieved results
- Cross-encoder scoring is more accurate than bi-encoder similarity but slower
- Only applied when retrieved results are non-empty
Step 6: Context Assembly and LLM Generation
Assemble the reranked chunks into a context string and send it along with the original query to the SageMaker-hosted LLM Twin model for answer generation. The InferenceExecutor handles prompt formatting and parameter configuration for the SageMaker endpoint.
Key considerations:
- Context is formatted from embedded chunks using the to_context class method
- The LLMInferenceSagemakerEndpoint handles the SageMaker API communication
- InferenceExecutor wraps the query, context, and generation parameters
- Opik traces capture metadata: model ID, embedding model, temperature, token counts
Step 7: API Response
Return the generated answer through the FastAPI REST API as a structured JSON response. The endpoint handles errors gracefully and returns appropriate HTTP status codes.
Key considerations:
- The /rag POST endpoint accepts a QueryRequest with a query string
- Responses are structured as QueryResponse with an answer field
- Errors return HTTP 500 with the exception detail
- The FastAPI server is launched via Uvicorn through the ml_service tool script