Implementation:PacktPublishing LLM Engineers Handbook InferenceExecutor Execute
Appearance
| Field | Value |
|---|---|
| Type | API Doc |
| Workflow | RAG_Inference |
| Repository | PacktPublishing/LLM-Engineers-Handbook |
| Source | run.py:L7-39, inference.py:L16-97 |
| Implements | Principle:PacktPublishing_LLM_Engineers_Handbook_Context_Assembly_And_LLM_Generation |
API Signature
InferenceExecutor(
llm: Inference,
query: str,
context: str | None,
prompt: str | None = None
).execute() -> str
Import
from llm_engineering.model.inference import InferenceExecutor, LLMInferenceSagemakerEndpoint
Key Code
From run.py (the InferenceExecutor class):
class InferenceExecutor:
def __init__(self, llm, query, context, prompt=None):
self.llm = llm
self.query = query
self.context = context
self.prompt = prompt or self._build_prompt()
def _build_prompt(self):
template = """...Context: {context}\nQuestion: {query}\nAnswer:"""
return template.format(context=self.context, query=self.query)
def execute(self) -> str:
self.llm.set_payload(
inputs=self.prompt,
parameters={
"max_new_tokens": 500,
"repetition_penalty": 1.1,
"temperature": 0.7,
"top_p": 0.9,
"top_k": 40,
"do_sample": True,
},
)
response = self.llm.inference()
return response[0]["generated_text"]
From inference.py (the SageMaker endpoint client):
class LLMInferenceSagemakerEndpoint(Inference):
def __init__(self, endpoint_name, inference_component_name=None):
self.endpoint_name = endpoint_name
self.client = boto3.client("sagemaker-runtime", ...)
def set_payload(self, inputs, parameters):
self.payload = {"inputs": inputs, "parameters": parameters}
def inference(self):
response = self.client.invoke_endpoint(
EndpointName=self.endpoint_name,
ContentType="application/json",
Body=json.dumps(self.payload),
)
return json.loads(response["Body"].read().decode("utf8"))
Parameters
| Parameter | Type | Description |
|---|---|---|
| llm | Inference | The LLM inference backend (e.g., LLMInferenceSagemakerEndpoint) |
| query | str | The user's original query string |
| context | str or None | The assembled context from retrieved documents |
| prompt | str or None | Optional custom prompt (auto-generated if not provided) |
Generation Parameters
| Parameter | Value | Description |
|---|---|---|
| max_new_tokens | 500 | Maximum tokens in the generated response |
| repetition_penalty | 1.1 | Penalty for repeating tokens (1.0 = no penalty) |
| temperature | 0.7 | Controls randomness in sampling |
| top_p | 0.9 | Nucleus sampling probability threshold |
| top_k | 40 | Number of top candidates considered at each step |
| do_sample | True | Enables stochastic sampling |
Inputs and Outputs
Inputs:
- query (str) - The user's natural language question
- context (str) - Concatenated text from retrieved and reranked document chunks
- SageMaker endpoint name - Configured via the Inference object
Outputs:
- str - The generated answer text from the LLM
How It Works
- The InferenceExecutor is initialized with an LLM backend, query, and context
- If no custom prompt is provided, _build_prompt() constructs one by inserting the context and query into a template
- The set_payload() method packages the prompt and generation parameters into a JSON payload
- The inference() method invokes the SageMaker endpoint via the
invoke_endpointAPI - The response body is parsed from JSON and the generated text is extracted
- The generated answer string is returned
External Dependencies
- boto3 - AWS SDK for invoking SageMaker runtime endpoints
- json - JSON serialization and deserialization for payloads
Source Files
llm_engineering/model/inference/run.py(lines 7-39)llm_engineering/model/inference/inference.py(lines 16-97)
See Also
- Principle:PacktPublishing_LLM_Engineers_Handbook_Context_Assembly_And_LLM_Generation
- Environment:PacktPublishing_LLM_Engineers_Handbook_AWS_SageMaker_GPU_Environment
- Heuristic:PacktPublishing_LLM_Engineers_Handbook_RAG_Retrieval_Parameters
- Heuristic:PacktPublishing_LLM_Engineers_Handbook_Temperature_Selection_By_Task
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment