Principle:Microsoft DeepSpeedExamples Model Evaluation And Export
Sources
- Blog: DeepSpeed-Chat — GitHub: DeepSpeed-Chat
Domains
- NLP
- Evaluation
- Model_Serving
Overview
A methodology for evaluating fine-tuned language models by comparing baseline and fine-tuned responses on standardized prompts.
Description
After any stage of the RLHF training pipeline (supervised fine-tuning, reward model training, or PPO-based alignment), the resulting model must be evaluated qualitatively to determine whether the fine-tuning improved response quality. The evaluation methodology follows a side-by-side comparison approach:
- Load two models — a baseline model (pre-training or earlier checkpoint) and the fine-tuned model.
- Define standardized evaluation prompts — a fixed set of diverse prompts covering factual knowledge, creative writing, explanation, and reasoning.
- Generate responses from both models using identical decoding parameters (e.g., greedy decoding).
- Print side-by-side comparisons — for each prompt, display the baseline response and the fine-tuned response for human inspection.
This approach reveals several important quality dimensions:
- Instruction following — Does the fine-tuned model respond to the prompt as instructed?
- Helpfulness — Are the fine-tuned responses more informative and useful?
- Safety — Does the fine-tuned model avoid harmful or inappropriate content?
- Fluency — Are responses grammatically correct and coherent?
- Factual accuracy — Does fine-tuning introduce or correct factual errors?
Decoding Strategies
Multiple decoding strategies can be employed during evaluation to test different aspects of model behavior:
| Strategy | Description | Use Case |
|---|---|---|
| Greedy decoding | Selects the highest-probability token at each step | Deterministic baseline; tests the model's most confident outputs |
| Beam search | Explores multiple candidate sequences in parallel | Balances quality with diversity |
| Multinomial sampling | Samples from the probability distribution | Tests generation diversity and creativity |
| Contrastive search | Penalizes repetitive tokens using a degeneration penalty | Reduces repetitive outputs while maintaining coherence |
| Diverse beam search | Groups beams and penalizes within-group similarity | Generates diverse high-quality candidates |
Multilingual Support
Evaluation prompts should be available in multiple languages to test the model's cross-lingual alignment. The DeepSpeed-Chat evaluation supports English, Chinese, and Japanese prompt sets.
Usage
Use this evaluation methodology after completing any training step in the RLHF pipeline:
- After Step 1 (SFT): Compare the base pre-trained model against the supervised fine-tuned model to verify instruction-following capability.
- After Step 2 (Reward Model): The reward model is not evaluated via generation but via ranking accuracy. However, the SFT model used as the reward model backbone can be evaluated.
- After Step 3 (RLHF/PPO): Compare the SFT model against the RLHF-trained model to verify that alignment training improved response quality beyond supervised fine-tuning alone.
Best Practices
- Use a fixed prompt set across evaluations to enable consistent comparison over time.
- Include diverse prompt types: factual questions, creative tasks, reasoning problems, and potentially sensitive topics.
- Evaluate with greedy decoding first as a deterministic baseline, then test with sampling-based strategies.
- Note prompt formatting: prompts should end with a colon or other delimiter to prevent the model from getting stuck. Prompts ending with a space can cause pre-trained (non-fine-tuned) models to produce no output.
- Document observations for each prompt pair to track improvement trends across training runs.