Principle:Microsoft DeepSpeedExamples Model Evaluation And Export

Sources

Blog: DeepSpeed-Chat — GitHub: DeepSpeed-Chat

Domains

NLP
Evaluation
Model_Serving

Overview

A methodology for evaluating fine-tuned language models by comparing baseline and fine-tuned responses on standardized prompts.

Description

After any stage of the RLHF training pipeline (supervised fine-tuning, reward model training, or PPO-based alignment), the resulting model must be evaluated qualitatively to determine whether the fine-tuning improved response quality. The evaluation methodology follows a side-by-side comparison approach:

Load two models — a baseline model (pre-training or earlier checkpoint) and the fine-tuned model.
Define standardized evaluation prompts — a fixed set of diverse prompts covering factual knowledge, creative writing, explanation, and reasoning.
Generate responses from both models using identical decoding parameters (e.g., greedy decoding).
Print side-by-side comparisons — for each prompt, display the baseline response and the fine-tuned response for human inspection.

This approach reveals several important quality dimensions:

Instruction following — Does the fine-tuned model respond to the prompt as instructed?
Helpfulness — Are the fine-tuned responses more informative and useful?
Safety — Does the fine-tuned model avoid harmful or inappropriate content?
Fluency — Are responses grammatically correct and coherent?
Factual accuracy — Does fine-tuning introduce or correct factual errors?

Decoding Strategies

Multiple decoding strategies can be employed during evaluation to test different aspects of model behavior:

Strategy	Description	Use Case
Greedy decoding	Selects the highest-probability token at each step	Deterministic baseline; tests the model's most confident outputs
Beam search	Explores multiple candidate sequences in parallel	Balances quality with diversity
Multinomial sampling	Samples from the probability distribution	Tests generation diversity and creativity
Contrastive search	Penalizes repetitive tokens using a degeneration penalty	Reduces repetitive outputs while maintaining coherence
Diverse beam search	Groups beams and penalizes within-group similarity	Generates diverse high-quality candidates

Multilingual Support

Evaluation prompts should be available in multiple languages to test the model's cross-lingual alignment. The DeepSpeed-Chat evaluation supports English, Chinese, and Japanese prompt sets.

Usage

Use this evaluation methodology after completing any training step in the RLHF pipeline:

After Step 1 (SFT): Compare the base pre-trained model against the supervised fine-tuned model to verify instruction-following capability.
After Step 2 (Reward Model): The reward model is not evaluated via generation but via ranking accuracy. However, the SFT model used as the reward model backbone can be evaluated.
After Step 3 (RLHF/PPO): Compare the SFT model against the RLHF-trained model to verify that alignment training improved response quality beyond supervised fine-tuning alone.

Best Practices

Use a fixed prompt set across evaluations to enable consistent comparison over time.
Include diverse prompt types: factual questions, creative tasks, reasoning problems, and potentially sensitive topics.
Evaluate with greedy decoding first as a deterministic baseline, then test with sampling-based strategies.
Note prompt formatting: prompts should end with a colon or other delimiter to prevent the model from getting stuck. Prompts ending with a space can cause pre-trained (non-fine-tuned) models to produce no output.
Document observations for each prompt pair to track improvement trends across training runs.

Related Pages

Implementation:Microsoft_DeepSpeedExamples_Prompt_Eval

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment