Implementation:Openai Evals Lambada

Knowledge Sources	Openai_Evals
Domains	Evaluation, Language Modeling
Last Updated	2026-02-14 10:00 GMT

Overview

Concrete eval for measuring next-word prediction accuracy on the LAMBADA benchmark, provided by the evals library.

Description

The Lambada class implements a language-model evaluation based on the LAMBADA benchmark (loaded from the EleutherAI/lambada_openai HuggingFace dataset). For each sample, it splits the text into a context (all words except the last) and an expected answer (the final word). It then constructs a prompt asking the model to predict the most likely next word, calls the completion function with temperature=0.0 and max_tokens=8, and uses evals.record_and_check_match to record whether the sampled output matches the expected word. The run method loads the specified subset's test split, evaluates all samples, and returns an accuracy metric computed from recorded match events.

Usage

Import Lambada when configuring an eval that tests a language model's ability to predict the final word of naturally occurring passages. The class is typically registered in an eval YAML spec with a subset parameter (e.g., "en" for English) selecting the LAMBADA language variant.

Code Reference

Source Location

Repository: Openai_Evals
File: evals/elsuite/lambada.py
Lines: 1-47

Signature

class Lambada(evals.Eval):
    def __init__(
        self,
        completion_fns: list[CompletionFn],
        subset: str,
        *args,
        **kwargs,
    ):
        ...

    def eval_sample(self, sample, rng):
        ...

    def run(self, recorder: RecorderBase) -> dict:
        ...

Import

from evals.elsuite.lambada import Lambada

I/O Contract

Inputs

init

Name	Type	Required	Description
completion_fns	list[CompletionFn]	Yes	List containing exactly one completion function to evaluate
subset	str	Yes	Language subset of the LAMBADA dataset to use (e.g., "en", "de", "fr")
*args	Any	No	Positional arguments forwarded to the parent evals.Eval constructor
**kwargs	Any	No	Keyword arguments forwarded to the parent evals.Eval constructor

eval_sample

Name	Type	Required	Description
sample	dict	Yes	A single dataset row with a "text" field containing a complete sentence
rng	Random	Yes	Random number generator (unused in this eval)

run

Name	Type	Required	Description
recorder	RecorderBase	Yes	Recorder instance that collects match events during evaluation

Outputs

run

Name	Type	Description
accuracy	float	Fraction of samples where the model's prediction exactly matched the expected final word

Usage Examples

from evals.elsuite.lambada import Lambada
from evals.api import CompletionFn
from evals.record import RecorderBase

# Assuming `my_completion_fn` is a configured CompletionFn instance
# and `recorder` is a RecorderBase instance:
lambada_eval = Lambada(
    completion_fns=[my_completion_fn],
    subset="en",
)
results = lambada_eval.run(recorder)
print(f"Accuracy: {results['accuracy']:.2%}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment