Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:PacktPublishing LLM Engineers Handbook InferenceExecutor Execute

From Leeroopedia


Field Value
Type API Doc
Workflow RAG_Inference
Repository PacktPublishing/LLM-Engineers-Handbook
Source run.py:L7-39, inference.py:L16-97
Implements Principle:PacktPublishing_LLM_Engineers_Handbook_Context_Assembly_And_LLM_Generation

API Signature

InferenceExecutor(
    llm: Inference,
    query: str,
    context: str | None,
    prompt: str | None = None
).execute() -> str

Import

from llm_engineering.model.inference import InferenceExecutor, LLMInferenceSagemakerEndpoint

Key Code

From run.py (the InferenceExecutor class):

class InferenceExecutor:
    def __init__(self, llm, query, context, prompt=None):
        self.llm = llm
        self.query = query
        self.context = context
        self.prompt = prompt or self._build_prompt()

    def _build_prompt(self):
        template = """...Context: {context}\nQuestion: {query}\nAnswer:"""
        return template.format(context=self.context, query=self.query)

    def execute(self) -> str:
        self.llm.set_payload(
            inputs=self.prompt,
            parameters={
                "max_new_tokens": 500,
                "repetition_penalty": 1.1,
                "temperature": 0.7,
                "top_p": 0.9,
                "top_k": 40,
                "do_sample": True,
            },
        )
        response = self.llm.inference()
        return response[0]["generated_text"]

From inference.py (the SageMaker endpoint client):

class LLMInferenceSagemakerEndpoint(Inference):
    def __init__(self, endpoint_name, inference_component_name=None):
        self.endpoint_name = endpoint_name
        self.client = boto3.client("sagemaker-runtime", ...)

    def set_payload(self, inputs, parameters):
        self.payload = {"inputs": inputs, "parameters": parameters}

    def inference(self):
        response = self.client.invoke_endpoint(
            EndpointName=self.endpoint_name,
            ContentType="application/json",
            Body=json.dumps(self.payload),
        )
        return json.loads(response["Body"].read().decode("utf8"))

Parameters

Parameter Type Description
llm Inference The LLM inference backend (e.g., LLMInferenceSagemakerEndpoint)
query str The user's original query string
context str or None The assembled context from retrieved documents
prompt str or None Optional custom prompt (auto-generated if not provided)

Generation Parameters

Parameter Value Description
max_new_tokens 500 Maximum tokens in the generated response
repetition_penalty 1.1 Penalty for repeating tokens (1.0 = no penalty)
temperature 0.7 Controls randomness in sampling
top_p 0.9 Nucleus sampling probability threshold
top_k 40 Number of top candidates considered at each step
do_sample True Enables stochastic sampling

Inputs and Outputs

Inputs:

  • query (str) - The user's natural language question
  • context (str) - Concatenated text from retrieved and reranked document chunks
  • SageMaker endpoint name - Configured via the Inference object

Outputs:

  • str - The generated answer text from the LLM

How It Works

  1. The InferenceExecutor is initialized with an LLM backend, query, and context
  2. If no custom prompt is provided, _build_prompt() constructs one by inserting the context and query into a template
  3. The set_payload() method packages the prompt and generation parameters into a JSON payload
  4. The inference() method invokes the SageMaker endpoint via the invoke_endpoint API
  5. The response body is parsed from JSON and the generated text is extracted
  6. The generated answer string is returned

External Dependencies

  • boto3 - AWS SDK for invoking SageMaker runtime endpoints
  • json - JSON serialization and deserialization for payloads

Source Files

  • llm_engineering/model/inference/run.py (lines 7-39)
  • llm_engineering/model/inference/inference.py (lines 16-97)

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment