Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:PacktPublishing LLM Engineers Handbook OpenAI Chat Completions

From Leeroopedia


Overview

OpenAI Chat Completions implements the Principle:PacktPublishing_LLM_Engineers_Handbook_LLM_As_Judge_Evaluation principle by using the OpenAI Chat Completions API with GPT-4o-mini as a judge model to score generated answers on accuracy and style, with multi-threaded execution for parallel evaluation.

Aspect Detail
Implementation Name OpenAI Chat Completions
Workflow Model_Evaluation
Type Wrapper Doc (OpenAI)
Source File llm_engineering/model/evaluation/evaluate.py (Lines 55–165)
Implements Principle:PacktPublishing_LLM_Engineers_Handbook_LLM_As_Judge_Evaluation

API Signatures

Single Answer Evaluation

def evaluate_answer(instruction: str, answer: str, client: OpenAI) -> dict

Batch Evaluation

def evaluate_answers(model_id: str, num_threads: int = 10, batch_size: int = 5) -> Dataset

Internally calls:

client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    response_format={"type": "json_object"},
    max_tokens=1000,
    temperature=0.9,
)

Key Code

Single Answer Evaluation

def evaluate_answer(instruction: str, answer: str, client: OpenAI) -> dict:
    message = f"""Score the following answer on accuracy (1-3) and style (1-3).

    Instruction: {instruction}
    Answer: {answer}

    Return JSON: {{"accuracy": int, "style": int, "evaluation": "explanation"}}"""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        response_format={"type": "json_object"},
        max_tokens=1000,
        temperature=0.9,
        messages=[{"role": "user", "content": message}],
    )

    return json.loads(response.choices[0].message.content)

Batch Evaluation with Threading

def evaluate_answers(model_id: str, num_threads: int = 10, batch_size: int = 5) -> Dataset:
    # Load the results dataset with generated answers
    dataset = load_dataset(
        f"{MODEL_HUGGINGFACE_WORKSPACE}/{model_id.split('/')[-1]}-results",
        split="all"
    )

    client = OpenAI()

    # Use ThreadPoolExecutor for parallel evaluation
    with ThreadPoolExecutor(max_workers=num_threads) as executor:
        futures = []
        for row in dataset:
            future = executor.submit(
                evaluate_answer,
                instruction=row["instruction"],
                answer=row["answers"],
                client=client,
            )
            futures.append(future)

        results = [f.result() for f in tqdm(futures, desc="Evaluating")]

    # Extract scores and add to dataset
    accuracies = [r["accuracy"] for r in results]
    styles = [r["style"] for r in results]
    evaluations = [r["evaluation"] for r in results]

    dataset = dataset.add_column("accuracy", accuracies)
    dataset = dataset.add_column("style", styles)
    dataset = dataset.add_column("evaluation", evaluations)

    dataset.push_to_hub(
        f"{MODEL_HUGGINGFACE_WORKSPACE}/{model_id.split('/')[-1]}-results"
    )

    return dataset

Imports

from openai import OpenAI
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm
from datasets import load_dataset
import json

Inputs

evaluate_answer

Parameter Type Description
instruction str The original prompt/instruction that was given to the model
answer str The generated answer from the fine-tuned model to be evaluated
client OpenAI An initialized OpenAI client instance

evaluate_answers

Parameter Type Description
model_id str HuggingFace model identifier, used to locate the results dataset on Hub
num_threads int Number of concurrent threads for parallel evaluation. Default: 10.
batch_size int Batch size for processing. Default: 5.

Outputs

evaluate_answer

Key Type Description
accuracy int Score from 1 (poor) to 3 (excellent) measuring factual correctness and relevance
style int Score from 1 (poor) to 3 (excellent) measuring clarity, formatting, and writing quality
evaluation str Free-text explanation of the scores from the judge model

evaluate_answers

Returns a Dataset augmented with accuracy, style, and evaluation columns. The dataset is also pushed to HuggingFace Hub.

OpenAI API Configuration

Parameter Value Purpose
model "gpt-4o-mini" Cost-effective judge model with strong instruction-following and JSON output capabilities
response_format {"type": "json_object"} Guarantees the response is valid JSON, enabling reliable parsing
max_tokens 1000 Sufficient for the scoring JSON plus explanation text
temperature 0.9 Allows slight variation in evaluations; useful for robustness measurement

Threading Architecture

The evaluate_answers function uses Python's ThreadPoolExecutor for parallel evaluation:

  • Why threads, not processes: The workload is I/O-bound (waiting for OpenAI API responses), not CPU-bound. Threads are more efficient for I/O-bound workloads due to lower overhead.
  • Default 10 threads: Balances throughput against OpenAI API rate limits. Too many threads may trigger rate limiting; too few underutilize available bandwidth.
  • Progress tracking: tqdm provides a progress bar over the futures, giving visibility into evaluation progress.

External Dependencies

Dependency Purpose
openai OpenAI Python client for Chat Completions API calls
concurrent.futures Python standard library for thread pool execution
tqdm Progress bar for monitoring evaluation progress
datasets HuggingFace Datasets library for loading and pushing evaluation results

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment