Implementation:PacktPublishing LLM Engineers Handbook InferenceExecutor Execute

Field	Value
Type	API Doc
Workflow	RAG_Inference
Repository	PacktPublishing/LLM-Engineers-Handbook
Source	run.py:L7-39, inference.py:L16-97
Implements	Principle:PacktPublishing_LLM_Engineers_Handbook_Context_Assembly_And_LLM_Generation

API Signature

InferenceExecutor(
    llm: Inference,
    query: str,
    context: str | None,
    prompt: str | None = None
).execute() -> str

Import

from llm_engineering.model.inference import InferenceExecutor, LLMInferenceSagemakerEndpoint

Key Code

From run.py (the InferenceExecutor class):

class InferenceExecutor:
    def __init__(self, llm, query, context, prompt=None):
        self.llm = llm
        self.query = query
        self.context = context
        self.prompt = prompt or self._build_prompt()

    def _build_prompt(self):
        template = """...Context: {context}\nQuestion: {query}\nAnswer:"""
        return template.format(context=self.context, query=self.query)

    def execute(self) -> str:
        self.llm.set_payload(
            inputs=self.prompt,
            parameters={
                "max_new_tokens": 500,
                "repetition_penalty": 1.1,
                "temperature": 0.7,
                "top_p": 0.9,
                "top_k": 40,
                "do_sample": True,
            },
        )
        response = self.llm.inference()
        return response[0]["generated_text"]

From inference.py (the SageMaker endpoint client):

class LLMInferenceSagemakerEndpoint(Inference):
    def __init__(self, endpoint_name, inference_component_name=None):
        self.endpoint_name = endpoint_name
        self.client = boto3.client("sagemaker-runtime", ...)

    def set_payload(self, inputs, parameters):
        self.payload = {"inputs": inputs, "parameters": parameters}

    def inference(self):
        response = self.client.invoke_endpoint(
            EndpointName=self.endpoint_name,
            ContentType="application/json",
            Body=json.dumps(self.payload),
        )
        return json.loads(response["Body"].read().decode("utf8"))

Parameters

Parameter	Type	Description
llm	Inference	The LLM inference backend (e.g., LLMInferenceSagemakerEndpoint)
query	str	The user's original query string
context	str or None	The assembled context from retrieved documents
prompt	str or None	Optional custom prompt (auto-generated if not provided)

Generation Parameters

Parameter	Value	Description
max_new_tokens	500	Maximum tokens in the generated response
repetition_penalty	1.1	Penalty for repeating tokens (1.0 = no penalty)
temperature	0.7	Controls randomness in sampling
top_p	0.9	Nucleus sampling probability threshold
top_k	40	Number of top candidates considered at each step
do_sample	True	Enables stochastic sampling

Inputs and Outputs

Inputs:

query (str) - The user's natural language question
context (str) - Concatenated text from retrieved and reranked document chunks
SageMaker endpoint name - Configured via the Inference object

Outputs:

str - The generated answer text from the LLM

How It Works

The InferenceExecutor is initialized with an LLM backend, query, and context
If no custom prompt is provided, _build_prompt() constructs one by inserting the context and query into a template
The set_payload() method packages the prompt and generation parameters into a JSON payload
The inference() method invokes the SageMaker endpoint via the invoke_endpoint API
The response body is parsed from JSON and the generated text is extracted
The generated answer string is returned

External Dependencies

boto3 - AWS SDK for invoking SageMaker runtime endpoints
json - JSON serialization and deserialization for payloads

Source Files

llm_engineering/model/inference/run.py (lines 7-39)
llm_engineering/model/inference/inference.py (lines 16-97)

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment