Implementation:PacktPublishing LLM Engineers Handbook OpenAI Chat Completions
Appearance
Overview
OpenAI Chat Completions implements the Principle:PacktPublishing_LLM_Engineers_Handbook_LLM_As_Judge_Evaluation principle by using the OpenAI Chat Completions API with GPT-4o-mini as a judge model to score generated answers on accuracy and style, with multi-threaded execution for parallel evaluation.
| Aspect | Detail |
|---|---|
| Implementation Name | OpenAI Chat Completions |
| Workflow | Model_Evaluation |
| Type | Wrapper Doc (OpenAI) |
| Source File | llm_engineering/model/evaluation/evaluate.py (Lines 55–165) |
| Implements | Principle:PacktPublishing_LLM_Engineers_Handbook_LLM_As_Judge_Evaluation |
API Signatures
Single Answer Evaluation
def evaluate_answer(instruction: str, answer: str, client: OpenAI) -> dict
Batch Evaluation
def evaluate_answers(model_id: str, num_threads: int = 10, batch_size: int = 5) -> Dataset
Internally calls:
client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
response_format={"type": "json_object"},
max_tokens=1000,
temperature=0.9,
)
Key Code
Single Answer Evaluation
def evaluate_answer(instruction: str, answer: str, client: OpenAI) -> dict:
message = f"""Score the following answer on accuracy (1-3) and style (1-3).
Instruction: {instruction}
Answer: {answer}
Return JSON: {{"accuracy": int, "style": int, "evaluation": "explanation"}}"""
response = client.chat.completions.create(
model="gpt-4o-mini",
response_format={"type": "json_object"},
max_tokens=1000,
temperature=0.9,
messages=[{"role": "user", "content": message}],
)
return json.loads(response.choices[0].message.content)
Batch Evaluation with Threading
def evaluate_answers(model_id: str, num_threads: int = 10, batch_size: int = 5) -> Dataset:
# Load the results dataset with generated answers
dataset = load_dataset(
f"{MODEL_HUGGINGFACE_WORKSPACE}/{model_id.split('/')[-1]}-results",
split="all"
)
client = OpenAI()
# Use ThreadPoolExecutor for parallel evaluation
with ThreadPoolExecutor(max_workers=num_threads) as executor:
futures = []
for row in dataset:
future = executor.submit(
evaluate_answer,
instruction=row["instruction"],
answer=row["answers"],
client=client,
)
futures.append(future)
results = [f.result() for f in tqdm(futures, desc="Evaluating")]
# Extract scores and add to dataset
accuracies = [r["accuracy"] for r in results]
styles = [r["style"] for r in results]
evaluations = [r["evaluation"] for r in results]
dataset = dataset.add_column("accuracy", accuracies)
dataset = dataset.add_column("style", styles)
dataset = dataset.add_column("evaluation", evaluations)
dataset.push_to_hub(
f"{MODEL_HUGGINGFACE_WORKSPACE}/{model_id.split('/')[-1]}-results"
)
return dataset
Imports
from openai import OpenAI
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm
from datasets import load_dataset
import json
Inputs
evaluate_answer
| Parameter | Type | Description |
|---|---|---|
instruction |
str |
The original prompt/instruction that was given to the model |
answer |
str |
The generated answer from the fine-tuned model to be evaluated |
client |
OpenAI |
An initialized OpenAI client instance |
evaluate_answers
| Parameter | Type | Description |
|---|---|---|
model_id |
str |
HuggingFace model identifier, used to locate the results dataset on Hub |
num_threads |
int |
Number of concurrent threads for parallel evaluation. Default: 10. |
batch_size |
int |
Batch size for processing. Default: 5. |
Outputs
evaluate_answer
| Key | Type | Description |
|---|---|---|
accuracy |
int |
Score from 1 (poor) to 3 (excellent) measuring factual correctness and relevance |
style |
int |
Score from 1 (poor) to 3 (excellent) measuring clarity, formatting, and writing quality |
evaluation |
str |
Free-text explanation of the scores from the judge model |
evaluate_answers
Returns a Dataset augmented with accuracy, style, and evaluation columns. The dataset is also pushed to HuggingFace Hub.
OpenAI API Configuration
| Parameter | Value | Purpose |
|---|---|---|
model |
"gpt-4o-mini" |
Cost-effective judge model with strong instruction-following and JSON output capabilities |
response_format |
{"type": "json_object"} |
Guarantees the response is valid JSON, enabling reliable parsing |
max_tokens |
1000 |
Sufficient for the scoring JSON plus explanation text |
temperature |
0.9 |
Allows slight variation in evaluations; useful for robustness measurement |
Threading Architecture
The evaluate_answers function uses Python's ThreadPoolExecutor for parallel evaluation:
- Why threads, not processes: The workload is I/O-bound (waiting for OpenAI API responses), not CPU-bound. Threads are more efficient for I/O-bound workloads due to lower overhead.
- Default 10 threads: Balances throughput against OpenAI API rate limits. Too many threads may trigger rate limiting; too few underutilize available bandwidth.
- Progress tracking:
tqdmprovides a progress bar over the futures, giving visibility into evaluation progress.
External Dependencies
| Dependency | Purpose |
|---|---|
openai |
OpenAI Python client for Chat Completions API calls |
concurrent.futures |
Python standard library for thread pool execution |
tqdm |
Progress bar for monitoring evaluation progress |
datasets |
HuggingFace Datasets library for loading and pushing evaluation results |
See Also
- Principle:PacktPublishing_LLM_Engineers_Handbook_LLM_As_Judge_Evaluation — the principle this implements
- Implementation:PacktPublishing_LLM_Engineers_Handbook_VLLM_LLM_Generate — the upstream step that produces the answers being judged
- Implementation:PacktPublishing_LLM_Engineers_Handbook_Dataset_Push_To_Hub — the downstream step that aggregates the judge scores
- Environment:PacktPublishing_LLM_Engineers_Handbook_VLLM_Evaluation_Environment
- Environment:PacktPublishing_LLM_Engineers_Handbook_API_Credentials
- Heuristic:PacktPublishing_LLM_Engineers_Handbook_Temperature_Selection_By_Task
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment