Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Openai Evals HHH Alignment Prompting

From Leeroopedia
Knowledge Sources
Domains Evaluation, Alignment, Prompt Engineering
Last Updated 2026-02-14 10:00 GMT

Overview

A prompting strategy that prepends curated example dialogues demonstrating helpful, harmless, and honest behaviour to steer completion models toward well-aligned responses during evaluation.

Description

HHH Alignment Prompting is a technique drawn from the research of Bai et al. (2022) that conditions a language model to produce safe, truthful, and useful outputs by injecting a structured preamble of example exchanges into the prompt. The preamble consists of multi-turn conversations between a human and an assistant where the assistant consistently exhibits helpful, harmless, and honest (HHH) behaviour. By exposing the model to these exemplar interactions before it encounters the actual task, the model is primed to follow the same behavioural patterns.

This technique is essential for base completion models that have not undergone instruction-tuning or RLHF. Such models lack built-in alignment guardrails and will otherwise generate unpredictable or undesirable output during evaluation. The HHH preamble acts as an in-context alignment mechanism, effectively replacing explicit fine-tuning with few-shot behavioural demonstrations.

The implementation works by replacing the original task description with the HHH alignment context. The original task description is then reinserted as a system message within the HHH dialogue history, ensuring that the model still receives the task instructions but within the behavioural framing established by the HHH examples. This preserves the evaluation semantics while wrapping them in alignment context.

Usage

Apply HHH Alignment Prompting in the following scenarios:

  • Evaluating base completion models (e.g., GPT-3 base, non-chat variants) that lack instruction-following training.
  • Benchmarking alignment quality by comparing model behaviour with and without HHH context.
  • Wrapping existing solvers to add alignment context without modifying the underlying solver logic.

This principle is not necessary for instruction-tuned or chat models (e.g., GPT-4, ChatGPT) that already incorporate alignment training. Using HHH prompting on such models may introduce redundant context and consume valuable token budget.

When configuring an evaluation run, the HHH solver is typically composed as an outer wrapper around an inner solver. The inner solver handles the actual model interaction, while the HHH wrapper handles prompt transformation:

solver:
  class: evals.solvers.hhh_solver:HHHSolver
  args:
    solver:
      class: evals.solvers.openai_solver:OpenAISolver
      args:
        model: davinci

Theoretical Basis

The theoretical foundation rests on in-context learning (ICL), where large language models adapt their behaviour based on examples provided in the prompt without any parameter updates. The HHH preamble exploits this by presenting a consistent pattern of aligned responses that the model learns to continue.

The algorithm proceeds as follows:

1. Receive the original task prompt containing:
   - task_description: the system-level instructions for the evaluation
   - message_history: the conversation turns so far

2. Construct the HHH preamble:
   - Load pre-defined HHH example dialogues (human/assistant pairs)
   - Insert task_description as a system message within the dialogue history

3. Replace the original prompt:
   - Remove the original task_description from the top level
   - Prepend the HHH preamble + embedded task_description to message_history

4. Pass the modified prompt to the inner solver for completion

5. Return the inner solver's output unchanged

The key insight from Bai et al. is that demonstration quality matters more than quantity: a small number of well-crafted HHH examples is sufficient to shift model behaviour substantially, especially for models that already have strong language modelling capabilities.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment