Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Openai Openai node Evals Resource

From Leeroopedia
Knowledge Sources
Domains SDK, Evals
Last Updated 2026-02-15 12:00 GMT

Overview

The Evals class is the Evals resource in the openai-node SDK, providing methods to create, retrieve, update, list, and delete evaluations that define testing criteria for assessing model performance.

Description

The Evals class extends APIResource and wraps the /evals REST endpoints. It is accessed via client.evals and provides full CRUD operations for managing evaluation definitions. An evaluation (Eval) represents a task to be tested against your LLM integration, such as improving chatbot quality, testing customer support scenarios, or comparing model performance.

Each evaluation is defined by a data_source_config and a set of testing_criteria. The data source config can be custom (with a user-defined JSON schema), logs (querying stored logs by metadata), or stored_completions (deprecated). Testing criteria are defined as graders, which can be of several types: LabelModelGrader (uses a model to assign labels), StringCheckGrader (string matching), TextSimilarityGrader (similarity metrics with a pass threshold), PythonGrader (runs a Python script), or ScoreModelGrader (uses a model to assign numeric scores).

The resource also exposes a runs sub-resource (of type Runs) for creating and managing evaluation runs that execute the evaluation against specific data and models. Key response types (EvalCreateResponse, EvalRetrieveResponse, EvalUpdateResponse, EvalListResponse) all share the same structure: id, created_at, data_source_config, metadata, name, object ('eval'), and testing_criteria.

Usage

Use this resource to define and manage evaluation configurations. After creating an evaluation with appropriate data source config and graders, use the runs sub-resource to execute the evaluation against different models and parameters.

Code Reference

Source Location

Signature

export class Evals extends APIResource {
  runs: RunsAPI.Runs;

  create(body: EvalCreateParams, options?: RequestOptions): APIPromise<EvalCreateResponse>;

  retrieve(evalID: string, options?: RequestOptions): APIPromise<EvalRetrieveResponse>;

  update(evalID: string, body: EvalUpdateParams, options?: RequestOptions): APIPromise<EvalUpdateResponse>;

  list(
    query?: EvalListParams | null,
    options?: RequestOptions,
  ): PagePromise<EvalListResponsesPage, EvalListResponse>;

  delete(evalID: string, options?: RequestOptions): APIPromise<EvalDeleteResponse>;
}

Import

import OpenAI from 'openai';
// Access via client.evals

I/O Contract

Inputs

Name Type Required Description
data_source_config (create) Logs | StoredCompletions Yes Configuration for the data source; Custom requires an item_schema JSON schema
testing_criteria (create) StringCheckGrader | TextSimilarity | Python | ScoreModel> Yes List of graders defining how to evaluate results
name (create) string No The name of the evaluation
metadata (create/update) null No Up to 16 key-value pairs for structured storage
evalID (retrieve/update/delete) string Yes The ID of the evaluation
order (list) 'desc' No Sort order for evals by timestamp
order_by (list) 'updated_at' No Field to sort by

Outputs

Name Type Description
EvalCreateResponse EvalCreateResponse Created eval with id, created_at, data_source_config, metadata, name, object ('eval'), testing_criteria
EvalRetrieveResponse EvalRetrieveResponse Retrieved eval with same structure
EvalUpdateResponse EvalUpdateResponse Updated eval with same structure
EvalListResponse EvalListResponse Paginated list item with same structure
EvalDeleteResponse EvalDeleteResponse Object with deleted (boolean), eval_id, and object fields

Usage Examples

import OpenAI from 'openai';

const client = new OpenAI();

// Create an evaluation with a custom data source and string check grader
const eval_ = await client.evals.create({
  name: 'My Chatbot Quality Eval',
  data_source_config: {
    type: 'custom',
    item_schema: {
      type: 'object',
      properties: {
        question: { type: 'string' },
        expected_answer: { type: 'string' },
      },
      required: ['question', 'expected_answer'],
    },
  },
  testing_criteria: [
    {
      type: 'string_check',
      name: 'exact_match',
      input: '{{sample.output_text}}',
      reference: '{{item.expected_answer}}',
      operation: 'eq',
    },
  ],
});
console.log(eval_.id);

// Retrieve an evaluation
const retrieved = await client.evals.retrieve(eval_.id);

// Update an evaluation
const updated = await client.evals.update(eval_.id, {
  name: 'Renamed Eval',
  metadata: { version: 'v2' },
});

// List evaluations
for await (const e of client.evals.list({ order: 'desc' })) {
  console.log(e.id, e.name);
}

// Delete an evaluation
const deleted = await client.evals.delete(eval_.id);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment