Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Microsoft DeepSpeedExamples Model Evaluation And Export

From Leeroopedia


Sources

Domains

  • NLP
  • Evaluation
  • Model_Serving

Overview

A methodology for evaluating fine-tuned language models by comparing baseline and fine-tuned responses on standardized prompts.

Description

After any stage of the RLHF training pipeline (supervised fine-tuning, reward model training, or PPO-based alignment), the resulting model must be evaluated qualitatively to determine whether the fine-tuning improved response quality. The evaluation methodology follows a side-by-side comparison approach:

  1. Load two models — a baseline model (pre-training or earlier checkpoint) and the fine-tuned model.
  2. Define standardized evaluation prompts — a fixed set of diverse prompts covering factual knowledge, creative writing, explanation, and reasoning.
  3. Generate responses from both models using identical decoding parameters (e.g., greedy decoding).
  4. Print side-by-side comparisons — for each prompt, display the baseline response and the fine-tuned response for human inspection.

This approach reveals several important quality dimensions:

  • Instruction following — Does the fine-tuned model respond to the prompt as instructed?
  • Helpfulness — Are the fine-tuned responses more informative and useful?
  • Safety — Does the fine-tuned model avoid harmful or inappropriate content?
  • Fluency — Are responses grammatically correct and coherent?
  • Factual accuracy — Does fine-tuning introduce or correct factual errors?

Decoding Strategies

Multiple decoding strategies can be employed during evaluation to test different aspects of model behavior:

Strategy Description Use Case
Greedy decoding Selects the highest-probability token at each step Deterministic baseline; tests the model's most confident outputs
Beam search Explores multiple candidate sequences in parallel Balances quality with diversity
Multinomial sampling Samples from the probability distribution Tests generation diversity and creativity
Contrastive search Penalizes repetitive tokens using a degeneration penalty Reduces repetitive outputs while maintaining coherence
Diverse beam search Groups beams and penalizes within-group similarity Generates diverse high-quality candidates

Multilingual Support

Evaluation prompts should be available in multiple languages to test the model's cross-lingual alignment. The DeepSpeed-Chat evaluation supports English, Chinese, and Japanese prompt sets.

Usage

Use this evaluation methodology after completing any training step in the RLHF pipeline:

  • After Step 1 (SFT): Compare the base pre-trained model against the supervised fine-tuned model to verify instruction-following capability.
  • After Step 2 (Reward Model): The reward model is not evaluated via generation but via ranking accuracy. However, the SFT model used as the reward model backbone can be evaluated.
  • After Step 3 (RLHF/PPO): Compare the SFT model against the RLHF-trained model to verify that alignment training improved response quality beyond supervised fine-tuning alone.

Best Practices

  • Use a fixed prompt set across evaluations to enable consistent comparison over time.
  • Include diverse prompt types: factual questions, creative tasks, reasoning problems, and potentially sensitive topics.
  • Evaluate with greedy decoding first as a deterministic baseline, then test with sampling-based strategies.
  • Note prompt formatting: prompts should end with a colon or other delimiter to prevent the model from getting stuck. Prompts ending with a space can cause pre-trained (non-fine-tuned) models to produce no output.
  • Document observations for each prompt pair to track improvement trends across training runs.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment