Workflow:PacktPublishing LLM Engineers Handbook Model Evaluation
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Model_Evaluation, LLM_Ops |
| Last Updated | 2026-02-08 07:45 GMT |
Overview
End-to-end process for evaluating fine-tuned LLM Twin models using vLLM for inference and GPT-4o-mini as an automated judge, scoring accuracy and style on the test dataset.
Description
This workflow evaluates the quality of fine-tuned models by comparing them against a baseline. It generates answers from three models (SFT model, DPO model, and Llama-3.1-8B-Instruct as baseline) using vLLM for efficient batch inference on GPU. Each answer is then scored by GPT-4o-mini acting as an LLM judge, evaluating both factual accuracy and writing style on a 1-3 scale. Results are aggregated and pushed to HuggingFace Hub as results datasets. The entire evaluation runs as a SageMaker processing job.
Usage
Execute this workflow after the LLM Finetuning pipeline has produced trained models on HuggingFace Hub. You need AWS SageMaker configured and the OpenAI API key set for GPT-4o-mini judge scoring. This workflow validates whether fine-tuning improved model quality before deploying to production.
Execution Steps
Step 1: SageMaker Processing Job Setup
Configure and launch a SageMaker processing job through ZenML. The job uses a HuggingFaceProcessor with GPU instance, injecting environment variables for HuggingFace, OpenAI API access, and the workspace identifiers for locating models and datasets.
Key considerations:
- Instance type is ml.g5.2xlarge for GPU-accelerated inference
- Environment variables specify dataset and model HuggingFace workspaces
- A dummy mode is available that limits evaluation to 10 samples for testing
- The processing job runs the evaluate.py script on SageMaker
Step 2: Model Validation
Verify that the fine-tuned models exist on HuggingFace Hub before attempting evaluation. For each model ID, check accessibility via the HuggingFace API and fall back to public default models if the user's models are not found.
Key considerations:
- Three models are evaluated: TwinLlama-3.1-8B (SFT), TwinLlama-3.1-8B-DPO, and Llama-3.1-8B-Instruct (baseline)
- Missing models gracefully fall back to mlabonne's public versions
- Dataset existence is also validated with fallback behavior
Step 3: Answer Generation
For each model, load it using vLLM and generate answers for all test set instructions using batch inference. Answers are formatted using the Alpaca instruction template and generated with configurable sampling parameters (temperature, top-p, min-p). Results are pushed to HuggingFace Hub as per-model results datasets.
Key considerations:
- vLLM provides high-throughput batch inference with efficient GPU memory management
- Sampling parameters: temperature=0.8, top_p=0.95, min_p=0.05, max_tokens=2048
- Each model's results are uploaded as a separate HuggingFace dataset (model-name-results)
- GPU memory is explicitly freed between model evaluations using garbage collection
Step 4: LLM Judge Scoring
Evaluate each generated answer using GPT-4o-mini as an automated judge. The judge scores each answer on two dimensions: accuracy (factual correctness, 1-3 scale) and style (appropriate tone for blog/social media content, 1-3 scale). Evaluation is parallelized using thread pools with configurable concurrency.
Key considerations:
- GPT-4o-mini receives structured evaluation prompts with scoring rubrics
- Responses are requested in JSON format for reliable parsing
- Multi-threaded evaluation with configurable batch size (default: 5) and thread count (default: 10)
- Failed evaluations are recorded as None to maintain dataset alignment
Step 5: Results Aggregation and Publishing
Compute aggregate accuracy and style scores across all evaluated samples for each model. Update the results datasets on HuggingFace Hub with evaluation scores, and print a summary comparison of all models.
Key considerations:
- Per-model average accuracy and style scores are computed and displayed
- Results datasets are updated in-place on HuggingFace Hub with new columns
- The summary enables direct comparison between SFT, DPO, and baseline models