Workflow:PacktPublishing LLM Engineers Handbook Model Evaluation

Knowledge Sources	LLM Engineers Handbook vLLM Docs OpenAI API Docs AWS SageMaker Docs
Domains	LLMs, Model_Evaluation, LLM_Ops
Last Updated	2026-02-08 07:45 GMT

Overview

End-to-end process for evaluating fine-tuned LLM Twin models using vLLM for inference and GPT-4o-mini as an automated judge, scoring accuracy and style on the test dataset.

Description

This workflow evaluates the quality of fine-tuned models by comparing them against a baseline. It generates answers from three models (SFT model, DPO model, and Llama-3.1-8B-Instruct as baseline) using vLLM for efficient batch inference on GPU. Each answer is then scored by GPT-4o-mini acting as an LLM judge, evaluating both factual accuracy and writing style on a 1-3 scale. Results are aggregated and pushed to HuggingFace Hub as results datasets. The entire evaluation runs as a SageMaker processing job.

Usage

Execute this workflow after the LLM Finetuning pipeline has produced trained models on HuggingFace Hub. You need AWS SageMaker configured and the OpenAI API key set for GPT-4o-mini judge scoring. This workflow validates whether fine-tuning improved model quality before deploying to production.

Execution Steps

Step 1: SageMaker Processing Job Setup

Configure and launch a SageMaker processing job through ZenML. The job uses a HuggingFaceProcessor with GPU instance, injecting environment variables for HuggingFace, OpenAI API access, and the workspace identifiers for locating models and datasets.

Key considerations:

Instance type is ml.g5.2xlarge for GPU-accelerated inference
Environment variables specify dataset and model HuggingFace workspaces
A dummy mode is available that limits evaluation to 10 samples for testing
The processing job runs the evaluate.py script on SageMaker

Step 2: Model Validation

Verify that the fine-tuned models exist on HuggingFace Hub before attempting evaluation. For each model ID, check accessibility via the HuggingFace API and fall back to public default models if the user's models are not found.

Key considerations:

Three models are evaluated: TwinLlama-3.1-8B (SFT), TwinLlama-3.1-8B-DPO, and Llama-3.1-8B-Instruct (baseline)
Missing models gracefully fall back to mlabonne's public versions
Dataset existence is also validated with fallback behavior

Step 3: Answer Generation

For each model, load it using vLLM and generate answers for all test set instructions using batch inference. Answers are formatted using the Alpaca instruction template and generated with configurable sampling parameters (temperature, top-p, min-p). Results are pushed to HuggingFace Hub as per-model results datasets.

Key considerations:

vLLM provides high-throughput batch inference with efficient GPU memory management
Sampling parameters: temperature=0.8, top_p=0.95, min_p=0.05, max_tokens=2048
Each model's results are uploaded as a separate HuggingFace dataset (model-name-results)
GPU memory is explicitly freed between model evaluations using garbage collection

Step 4: LLM Judge Scoring

Evaluate each generated answer using GPT-4o-mini as an automated judge. The judge scores each answer on two dimensions: accuracy (factual correctness, 1-3 scale) and style (appropriate tone for blog/social media content, 1-3 scale). Evaluation is parallelized using thread pools with configurable concurrency.

Key considerations:

GPT-4o-mini receives structured evaluation prompts with scoring rubrics
Responses are requested in JSON format for reliable parsing
Multi-threaded evaluation with configurable batch size (default: 5) and thread count (default: 10)
Failed evaluations are recorded as None to maintain dataset alignment

Step 5: Results Aggregation and Publishing

Compute aggregate accuracy and style scores across all evaluated samples for each model. Update the results datasets on HuggingFace Hub with evaluation scores, and print a summary comparison of all models.

Key considerations:

Per-model average accuracy and style scores are computed and displayed
Results datasets are updated in-place on HuggingFace Hub with new columns
The summary enables direct comparison between SFT, DPO, and baseline models

Execution Diagram

GitHub URL

Workflow Repository