Implementation:PacktPublishing LLM Engineers Handbook OpenAI Chat Completions

Overview

OpenAI Chat Completions implements the Principle:PacktPublishing_LLM_Engineers_Handbook_LLM_As_Judge_Evaluation principle by using the OpenAI Chat Completions API with GPT-4o-mini as a judge model to score generated answers on accuracy and style, with multi-threaded execution for parallel evaluation.

Aspect	Detail
Implementation Name	OpenAI Chat Completions
Workflow	Model_Evaluation
Type	Wrapper Doc (OpenAI)
Source File	llm_engineering/model/evaluation/evaluate.py (Lines 55–165)
Implements	Principle:PacktPublishing_LLM_Engineers_Handbook_LLM_As_Judge_Evaluation

API Signatures

Single Answer Evaluation

def evaluate_answer(instruction: str, answer: str, client: OpenAI) -> dict

Batch Evaluation

def evaluate_answers(model_id: str, num_threads: int = 10, batch_size: int = 5) -> Dataset

Internally calls:

client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    response_format={"type": "json_object"},
    max_tokens=1000,
    temperature=0.9,
)

Key Code

Single Answer Evaluation

def evaluate_answer(instruction: str, answer: str, client: OpenAI) -> dict:
    message = f"""Score the following answer on accuracy (1-3) and style (1-3).

    Instruction: {instruction}
    Answer: {answer}

    Return JSON: {{"accuracy": int, "style": int, "evaluation": "explanation"}}"""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        response_format={"type": "json_object"},
        max_tokens=1000,
        temperature=0.9,
        messages=[{"role": "user", "content": message}],
    )

    return json.loads(response.choices[0].message.content)

Batch Evaluation with Threading

def evaluate_answers(model_id: str, num_threads: int = 10, batch_size: int = 5) -> Dataset:
    # Load the results dataset with generated answers
    dataset = load_dataset(
        f"{MODEL_HUGGINGFACE_WORKSPACE}/{model_id.split('/')[-1]}-results",
        split="all"
    )

    client = OpenAI()

    # Use ThreadPoolExecutor for parallel evaluation
    with ThreadPoolExecutor(max_workers=num_threads) as executor:
        futures = []
        for row in dataset:
            future = executor.submit(
                evaluate_answer,
                instruction=row["instruction"],
                answer=row["answers"],
                client=client,
            )
            futures.append(future)

        results = [f.result() for f in tqdm(futures, desc="Evaluating")]

    # Extract scores and add to dataset
    accuracies = [r["accuracy"] for r in results]
    styles = [r["style"] for r in results]
    evaluations = [r["evaluation"] for r in results]

    dataset = dataset.add_column("accuracy", accuracies)
    dataset = dataset.add_column("style", styles)
    dataset = dataset.add_column("evaluation", evaluations)

    dataset.push_to_hub(
        f"{MODEL_HUGGINGFACE_WORKSPACE}/{model_id.split('/')[-1]}-results"
    )

    return dataset

Imports

from openai import OpenAI
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm
from datasets import load_dataset
import json

Inputs

evaluate_answer

Parameter	Type	Description
`instruction`	`str`	The original prompt/instruction that was given to the model
`answer`	`str`	The generated answer from the fine-tuned model to be evaluated
`client`	`OpenAI`	An initialized OpenAI client instance

evaluate_answers

Parameter	Type	Description
`model_id`	`str`	HuggingFace model identifier, used to locate the results dataset on Hub
`num_threads`	`int`	Number of concurrent threads for parallel evaluation. Default: 10.
`batch_size`	`int`	Batch size for processing. Default: 5.

Outputs

evaluate_answer

Key	Type	Description
`accuracy`	`int`	Score from 1 (poor) to 3 (excellent) measuring factual correctness and relevance
`style`	`int`	Score from 1 (poor) to 3 (excellent) measuring clarity, formatting, and writing quality
`evaluation`	`str`	Free-text explanation of the scores from the judge model

evaluate_answers

Returns a Dataset augmented with accuracy, style, and evaluation columns. The dataset is also pushed to HuggingFace Hub.

OpenAI API Configuration

Parameter	Value	Purpose
`model`	`"gpt-4o-mini"`	Cost-effective judge model with strong instruction-following and JSON output capabilities
`response_format`	`{"type": "json_object"}`	Guarantees the response is valid JSON, enabling reliable parsing
`max_tokens`	`1000`	Sufficient for the scoring JSON plus explanation text
`temperature`	`0.9`	Allows slight variation in evaluations; useful for robustness measurement

Threading Architecture

The evaluate_answers function uses Python's ThreadPoolExecutor for parallel evaluation:

Why threads, not processes: The workload is I/O-bound (waiting for OpenAI API responses), not CPU-bound. Threads are more efficient for I/O-bound workloads due to lower overhead.
Default 10 threads: Balances throughput against OpenAI API rate limits. Too many threads may trigger rate limiting; too few underutilize available bandwidth.
Progress tracking: tqdm provides a progress bar over the futures, giving visibility into evaluation progress.

External Dependencies

Dependency	Purpose
`openai`	OpenAI Python client for Chat Completions API calls
`concurrent.futures`	Python standard library for thread pool execution
`tqdm`	Progress bar for monitoring evaluation progress
`datasets`	HuggingFace Datasets library for loading and pushing evaluation results

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment

Overview

API Signatures

Single Answer Evaluation

Batch Evaluation

Key Code

Single Answer Evaluation

Batch Evaluation with Threading

Imports

Inputs

evaluate_answer

evaluate_answers

Outputs

evaluate_answer

evaluate_answers

OpenAI API Configuration

Threading Architecture

External Dependencies

See Also

Page Connections