Principle:Microsoft Semantic kernel Execution Settings Configuration

Knowledge Sources	Semantic Kernel Documentation Semantic Kernel
Domains	AI_Orchestration, LLM_Parameter_Tuning
Last Updated	2026-02-11 19:00 GMT

Overview

Execution Settings Configuration is the principle of fine-tuning the behavior of AI model responses through a structured set of generation parameters that control randomness, length, format, and sampling strategy.

Description

Language models expose a set of generation parameters that profoundly affect the quality, style, and structure of their output. These parameters include temperature (controlling randomness), maximum token count (controlling length), top-p (nucleus sampling threshold), frequency and presence penalties (controlling repetition), and response format (controlling output structure). Execution Settings Configuration provides a typed, validated mechanism for specifying these parameters as part of the prompt invocation.

Rather than embedding generation parameters into raw API request bodies, the execution settings pattern encapsulates them in a strongly-typed object that is validated at compile time and associated with the prompt through the KernelArguments container. This approach ensures that parameter names are correct (no typos in string keys), parameter values are within valid ranges (enforced by property setters), and parameters are provider-appropriate (OpenAI-specific settings are used with OpenAI services, not with incompatible providers).

The execution settings architecture is hierarchical. A base PromptExecutionSettings class defines provider-agnostic properties (such as ModelId and ServiceId), while provider-specific subclasses (such as OpenAIPromptExecutionSettings) add parameters unique to that provider. This design allows application code to use the base type for portability or the derived type for full provider-specific control. The kernel's execution pipeline inspects the settings type at runtime and maps it to the appropriate API parameters for the resolved service.

Usage

Use execution settings whenever the default AI service parameters do not produce the desired output behavior. Common scenarios include: reducing temperature to zero for deterministic factual responses, increasing max tokens for long-form content generation, setting response format to JSON for structured output, and adjusting penalties to reduce repetitive text. Execution settings can be specified per-invocation through KernelArguments or per-function through prompt configuration.

Theoretical Basis

The generation parameters controlled by execution settings correspond to well-defined concepts in language model theory:

Temperature (T):

Temperature scales the logits (raw output scores) before the softmax function, controlling the entropy of the output distribution:

P(token_i) = exp(logit_i / T) / sum_j(exp(logit_j / T))

T → 0: distribution collapses to argmax (deterministic, greedy decoding)
T = 1: standard softmax (model's natural distribution)
T → ∞: uniform distribution (maximum randomness)

Top-P (nucleus sampling):

Top-P selects the smallest set of tokens whose cumulative probability exceeds the threshold P:

V_p = smallest set V such that: sum_{token in V} P(token) >= p

Sampling is restricted to tokens in V_p.

p = 1.0: consider all tokens (equivalent to no filtering)
p = 0.1: consider only the most probable tokens summing to 10%

Max Tokens:

A hard upper bound on the number of tokens in the generated response:

|output| <= MaxTokens

The model stops generation when either:
  1. A stop sequence or end-of-sequence token is generated
  2. The token count reaches MaxTokens

Frequency Penalty and Presence Penalty:

These parameters reduce repetition by penalizing tokens that have already appeared:

adjusted_logit(token) = logit(token)
  - frequency_penalty * count(token in output so far)
  - presence_penalty * (1 if token in output so far else 0)

Frequency penalty: penalizes proportionally to occurrence count (reduces highly repeated tokens)
Presence penalty: penalizes any repetition equally (encourages topic diversity)

Response Format:

Controls the structural format of the output:

ResponseFormat = "text"        → free-form text output
ResponseFormat = "json_object" → model is constrained to produce valid JSON
ResponseFormat = ChatResponseFormat.JsonSchema(schema) → output conforms to a specific JSON schema

These parameters interact with each other and with the prompt to produce the final output distribution. Effective configuration requires understanding both the theoretical effect of each parameter and its practical impact on the specific model being used.

Related Pages

Implemented By

Implementation:Microsoft_Semantic_kernel_OpenAIPromptExecutionSettings

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment