Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:PacktPublishing LLM Engineers Handbook VLLM Evaluation Environment

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, LLMs, Evaluation
Last Updated 2026-02-08 08:00 GMT

Overview

GPU-accelerated evaluation environment with vLLM for batch inference and OpenAI API for LLM-as-judge scoring, running inside SageMaker processing jobs.

Description

This environment provides the model evaluation stack running inside a SageMaker processing container. It uses vLLM for high-throughput batch inference generation across multiple model variants, and the OpenAI API (GPT-4o-mini) for LLM-as-judge evaluation scoring. The evaluation pipeline generates answers from fine-tuned models and a baseline (Llama 3.1 8B Instruct), then scores each answer on accuracy and style dimensions using a structured JSON response format.

Usage

Use this environment exclusively for the Model Evaluation workflow. It runs inside a SageMaker `ml.g5.2xlarge` processing job and is installed via a separate `requirements.txt`. The environment handles loading models with vLLM, generating batch predictions, calling the OpenAI API for judge evaluations, and pushing results to the HuggingFace Hub.

System Requirements

Category Requirement Notes
Hardware NVIDIA GPU with CUDA support Minimum 24GB VRAM for 8B model inference
Runtime SageMaker Processing Container PyTorch 2.1 base, Python 3.10
Network Internet access Required for OpenAI API calls and HuggingFace Hub
Disk ~20GB Model weights for multiple model variants

Dependencies

Python Packages (requirements.txt)

  • `transformers` = 4.43.3
  • `datasets` = 2.20.0
  • `vllm` = 0.6.1.post2
  • `tqdm` = 4.66.4
  • `openai` = 1.55.3

Credentials

The following environment variables are injected into the SageMaker processing container:

  • `OPENAI_API_KEY`: OpenAI API key for GPT-4o-mini judge evaluations
  • `DATASET_HUGGINGFACE_WORKSPACE`: HuggingFace workspace for dataset access
  • `MODEL_HUGGINGFACE_WORKSPACE`: HuggingFace workspace for model access
  • `HUGGING_FACE_HUB_TOKEN`: HuggingFace token for model/dataset downloads and uploads
  • `IS_DUMMY`: Optional flag for testing with reduced samples (default: `False`)

Quick Install

# These packages are installed automatically inside the SageMaker container.
# For local testing (requires CUDA GPU):
pip install transformers==4.43.3 datasets==2.20.0 vllm==0.6.1.post2 \
    tqdm==4.66.4 openai==1.55.3

Code Evidence

Credential assertions from `llm_engineering/model/evaluation/sagemaker.py:18-20`:

assert settings.HUGGINGFACE_ACCESS_TOKEN, "Hugging Face access token is required."
assert settings.OPENAI_API_KEY, "OpenAI API key is required."
assert settings.AWS_ARN_ROLE, "AWS ARN role is required."

Environment variable access from `llm_engineering/model/evaluation/evaluate.py:13-16`:

OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
DATASET_HUGGINGFACE_WORKSPACE = os.environ["DATASET_HUGGINGFACE_WORKSPACE"]
MODEL_HUGGINGFACE_WORKSPACE = os.environ["MODEL_HUGGINGFACE_WORKSPACE"]
IS_DUMMY = os.environ.get("IS_DUMMY", False)

vLLM model loading from `llm_engineering/model/evaluation/evaluate.py:42`:

llm = LLM(model=model_id, max_model_len=2048)
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, min_p=0.05, max_tokens=2048)

Common Errors

Error Message Cause Solution
`CUDA out of memory` loading vLLM model Insufficient GPU VRAM Reduce `max_model_len` or use larger instance
`KeyError: 'OPENAI_API_KEY'` OpenAI key not injected into container Verify `env` dict in `sagemaker.py` processing job config
`openai.RateLimitError` Too many concurrent judge API calls Reduce `num_threads` parameter (default: 10)

Compatibility Notes

  • vLLM: Requires CUDA-capable NVIDIA GPU. Not compatible with AMD or Intel GPUs.
  • Model Size: The default evaluation loads 8B parameter models. Larger models require larger SageMaker instances.
  • Dummy Mode: Set `IS_DUMMY=True` to run evaluation on only 10 samples for testing.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment