Environment:PacktPublishing LLM Engineers Handbook VLLM Evaluation Environment
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, LLMs, Evaluation |
| Last Updated | 2026-02-08 08:00 GMT |
Overview
GPU-accelerated evaluation environment with vLLM for batch inference and OpenAI API for LLM-as-judge scoring, running inside SageMaker processing jobs.
Description
This environment provides the model evaluation stack running inside a SageMaker processing container. It uses vLLM for high-throughput batch inference generation across multiple model variants, and the OpenAI API (GPT-4o-mini) for LLM-as-judge evaluation scoring. The evaluation pipeline generates answers from fine-tuned models and a baseline (Llama 3.1 8B Instruct), then scores each answer on accuracy and style dimensions using a structured JSON response format.
Usage
Use this environment exclusively for the Model Evaluation workflow. It runs inside a SageMaker `ml.g5.2xlarge` processing job and is installed via a separate `requirements.txt`. The environment handles loading models with vLLM, generating batch predictions, calling the OpenAI API for judge evaluations, and pushing results to the HuggingFace Hub.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Hardware | NVIDIA GPU with CUDA support | Minimum 24GB VRAM for 8B model inference |
| Runtime | SageMaker Processing Container | PyTorch 2.1 base, Python 3.10 |
| Network | Internet access | Required for OpenAI API calls and HuggingFace Hub |
| Disk | ~20GB | Model weights for multiple model variants |
Dependencies
Python Packages (requirements.txt)
- `transformers` = 4.43.3
- `datasets` = 2.20.0
- `vllm` = 0.6.1.post2
- `tqdm` = 4.66.4
- `openai` = 1.55.3
Credentials
The following environment variables are injected into the SageMaker processing container:
- `OPENAI_API_KEY`: OpenAI API key for GPT-4o-mini judge evaluations
- `DATASET_HUGGINGFACE_WORKSPACE`: HuggingFace workspace for dataset access
- `MODEL_HUGGINGFACE_WORKSPACE`: HuggingFace workspace for model access
- `HUGGING_FACE_HUB_TOKEN`: HuggingFace token for model/dataset downloads and uploads
- `IS_DUMMY`: Optional flag for testing with reduced samples (default: `False`)
Quick Install
# These packages are installed automatically inside the SageMaker container.
# For local testing (requires CUDA GPU):
pip install transformers==4.43.3 datasets==2.20.0 vllm==0.6.1.post2 \
tqdm==4.66.4 openai==1.55.3
Code Evidence
Credential assertions from `llm_engineering/model/evaluation/sagemaker.py:18-20`:
assert settings.HUGGINGFACE_ACCESS_TOKEN, "Hugging Face access token is required."
assert settings.OPENAI_API_KEY, "OpenAI API key is required."
assert settings.AWS_ARN_ROLE, "AWS ARN role is required."
Environment variable access from `llm_engineering/model/evaluation/evaluate.py:13-16`:
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
DATASET_HUGGINGFACE_WORKSPACE = os.environ["DATASET_HUGGINGFACE_WORKSPACE"]
MODEL_HUGGINGFACE_WORKSPACE = os.environ["MODEL_HUGGINGFACE_WORKSPACE"]
IS_DUMMY = os.environ.get("IS_DUMMY", False)
vLLM model loading from `llm_engineering/model/evaluation/evaluate.py:42`:
llm = LLM(model=model_id, max_model_len=2048)
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, min_p=0.05, max_tokens=2048)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `CUDA out of memory` loading vLLM model | Insufficient GPU VRAM | Reduce `max_model_len` or use larger instance |
| `KeyError: 'OPENAI_API_KEY'` | OpenAI key not injected into container | Verify `env` dict in `sagemaker.py` processing job config |
| `openai.RateLimitError` | Too many concurrent judge API calls | Reduce `num_threads` parameter (default: 10) |
Compatibility Notes
- vLLM: Requires CUDA-capable NVIDIA GPU. Not compatible with AMD or Intel GPUs.
- Model Size: The default evaluation loads 8B parameter models. Larger models require larger SageMaker instances.
- Dummy Mode: Set `IS_DUMMY=True` to run evaluation on only 10 samples for testing.