Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Openai Evals Few Shot Prompting

From Leeroopedia
Revision as of 18:01, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Openai_Evals_Few_Shot_Prompting.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Evaluation, Prompting Strategy, In-Context Learning
Last Updated 2026-02-14 10:00 GMT

Overview

A principle that improves model performance by prepending example input-output pairs to the prompt, enabling the model to learn task patterns through in-context demonstration without any parameter updates.

Description

Few-shot prompting leverages the in-context learning capability of large language models by providing a small number of demonstration examples (shots) before the actual query. The model observes the pattern established by these examples and applies it to the new input, effectively learning the task format, expected output style, and domain-specific conventions on the fly.

The Openai Evals framework implements few-shot prompting through the FewShotSolver class, which manages the critical aspects of example selection and prompt construction:

  • Shot count (n_shots): The number of demonstration examples prepended to the prompt. Common configurations include 0-shot (no examples, baseline), 1-shot, 5-shot, and 25-shot. The optimal number depends on the task complexity, model capacity, and available context window.
  • Sampling strategy: Examples can be selected through different approaches:
    • Random sampling -- uniformly drawn from the training set for each query.
    • Fixed sampling -- the same set of examples used for every query, ensuring reproducibility.
    • Stratified sampling -- ensuring balanced representation across categories or difficulty levels.
  • Contamination checking: A critical concern in few-shot evaluation is ensuring that the demonstration examples do not leak information about the test query. The framework must verify that few-shot examples are drawn from a separate pool and do not include the answer to the current question or near-duplicate questions.
  • Prompt construction: The few-shot examples are formatted and inserted into the conversation history as alternating user/assistant message pairs, establishing the expected interaction pattern before the actual query is presented.

Usage

Apply few-shot prompting when:

  • The task has a specific format or convention that the model might not follow with zero-shot prompting alone (e.g., answering with just a letter, providing structured JSON output).
  • You want to improve accuracy on domain-specific tasks where the model benefits from seeing worked examples.
  • You need to establish a baseline comparison between zero-shot and few-shot performance to understand the value of demonstrations.
  • The task involves specialized vocabulary or notation that examples can clarify.
  • You want to reduce output variance by anchoring the model's behavior with consistent demonstrations.

Avoid few-shot prompting when:

  • The context window is too limited to accommodate both examples and the query.
  • The task is simple enough that zero-shot performance is already near-ceiling.
  • There is a risk of contamination between the example pool and the test set.

Theoretical Basis

Few-shot prompting is grounded in the theory of in-context learning (ICL), first systematically studied by Brown et al. (2020). The key insight is that large language models, trained on diverse text corpora, develop the ability to identify and apply patterns from demonstrations without gradient updates.

Formal setup:

Given:
  D = {(x_1, y_1), (x_2, y_2), ..., (x_k, y_k)}  -- k demonstration examples
  x_q                                                -- query input

Prompt construction:
  prompt = format(x_1) + format(y_1) + ... + format(x_k) + format(y_k) + format(x_q)

Prediction:
  y_q = model.generate(prompt)

Scaling behavior with shot count:

Performance typically follows a logarithmic curve:

  accuracy(k) ~ a * log(k) + b    for k >= 1

where:
  k = number of shots
  a = task-dependent learning rate from examples
  b = zero-shot baseline performance

This means early shots provide the largest marginal improvement, with diminishing returns as more examples are added.

In-context learning as implicit Bayesian inference:

One theoretical interpretation is that the model performs implicit Bayesian inference over a latent task variable:

P(y_q | x_q, D) = integral over T of P(y_q | x_q, T) * P(T | D) dT

where:
  T = latent task concept
  P(T | D) = posterior over tasks given demonstrations
  P(y_q | x_q, T) = prediction given the inferred task

The demonstrations narrow the model's uncertainty about which task is being performed, leading to more accurate predictions.

Prompt format for the FewShotSolver:

[system] Task description
[user]   Example question 1
[assistant] Example answer 1
[user]   Example question 2
[assistant] Example answer 2
...
[user]   Example question k
[assistant] Example answer k
[user]   Actual query

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment