Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Microsoft DeepSpeedExamples Prompt Eval

From Leeroopedia


Type

Pattern Doc (CLI script)

Overview

Concrete tool for side-by-side evaluation of baseline vs fine-tuned language models provided by the DeepSpeed-Chat library.

Description

prompt_eval.py is a command-line evaluation script that performs qualitative side-by-side comparison of two language models. It:

  1. Loads a baseline model and a fine-tuned model using create_hf_model() from the DeepSpeed-Chat utilities.
  2. Loads a shared tokenizer from the baseline model path.
  3. Selects a set of built-in evaluation prompts based on the specified language (English, Chinese, or Japanese).
  4. For each prompt, generates responses from both models using greedy decoding.
  5. Prints the responses in a clearly labeled side-by-side format for human inspection.

The script uses deepspeed.get_accelerator() for device placement, placing both models on GPU 0. It supports additional decoding strategies (beam search, multinomial sampling, diverse beam search, contrastive search) through commented-out code blocks that users can enable as needed.

Code Reference

  • File: applications/DeepSpeed-Chat/training/step1_supervised_finetuning/prompt_eval.py
  • Lines: 1-135

Signature (CLI Arguments)

python prompt_eval.py \
    --model_name_or_path_baseline <path_to_baseline> \
    --model_name_or_path_finetune <path_to_finetuned> \
    --max_new_tokens 100 \
    --language English

Full Argument Reference

Argument Type Default Required Description
--model_name_or_path_baseline str Yes Path to the baseline model (pre-trained or earlier checkpoint)
--model_name_or_path_finetune str Yes Path to the fine-tuned model to evaluate
--num_beams int 1 No Number of beams for beam search decoding
--num_beam_groups int 1 No Number of beam groups for diverse beam search
--top_k int 4 No Top-k parameter for contrastive search
--penalty_alpha float 0.6 No Degeneration penalty for contrastive search
--num_return_sequences int 1 No Number of sequences to return per prompt
--max_new_tokens int 100 No Maximum number of new tokens to generate per response
--language str English No Language for evaluation prompts (English, Chinese, or Japanese)
--add_eot_token flag False No Add <|endoftext|> as an additional special token to the tokenizer

Built-in Evaluation Prompts

English

# Prompt
1 Human: Please tell me about Microsoft in a few sentence? Assistant:
2 Human: Explain the moon landing to a 6 year old in a few sentences. Assistant:
3 Human: Write a short poem about a wise frog. Assistant:
4 Human: Who was president of the United States in 1955? Assistant:
5 Human: How does a telescope work? Assistant:
6 Human: Why do birds migrate south for the winter? Assistant:

Chinese

# Prompt
1 Human: 请用几句话介绍一下微软? Assistant:
2 Human: 用几句话向6岁的孩子解释登月。 Assistant:
3 Human: 写一首关于一只聪明的青蛙的短诗。 Assistant:
4 Human: 谁是1955年的美国总统? Assistant:
5 Human: 望远镜是如何工作的? Assistant:
6 Human: 鸟类为什么要南迁过冬? Assistant:

Japanese

# Prompt
1 Human: マイクロソフトについて簡単に教えてください。 Assistant:
2 Human: 6歳児に月面着陸を短い文で説明する。 Assistant:
3 Human: 賢いカエルについて短い詩を書いてください。 Assistant:
4 Human: 1955年のアメリカ合衆国大統領は誰? Assistant:
5 Human: 望遠鏡はどのように機能しますか? Assistant:
6 Human: 鳥が冬に南に移動するのはなぜですか? Assistant:

Key Internal Functions

generate()

def generate(model, tokenizer, inputs, num_beams=1, num_beam_groups=1,
             do_sample=False, num_return_sequences=1, max_new_tokens=100):
    generate_ids = model.generate(
        inputs.input_ids,
        num_beams=num_beams,
        num_beam_groups=num_beam_groups,
        do_sample=do_sample,
        num_return_sequences=num_return_sequences,
        max_new_tokens=max_new_tokens
    )
    result = tokenizer.batch_decode(generate_ids, skip_special_tokens=True,
                                    clean_up_tokenization_spaces=False)
    return result

Performs standard HuggingFace model.generate() with configurable decoding parameters and decodes the output tokens back to text.

generate_constrastive_search()

def generate_constrastive_search(model, tokenizer, inputs, top_k=4,
                                  penalty_alpha=0.6, num_return_sequences=1,
                                  max_new_tokens=100):
    generate_ids = model.generate(
        inputs.input_ids,
        top_k=top_k,
        penalty_alpha=penalty_alpha,
        num_return_sequences=num_return_sequences,
        max_new_tokens=max_new_tokens
    )
    result = tokenizer.batch_decode(generate_ids, skip_special_tokens=True,
                                    clean_up_tokenization_spaces=False)
    return result

Uses contrastive search decoding with a degeneration penalty to reduce repetitive outputs.

prompt_eval()

def prompt_eval(args, model_baseline, model_fintuned, tokenizer, device, prompts):
    for prompt in prompts:
        inputs = tokenizer(prompt, return_tensors="pt").to(device)
        print("==========Baseline: Greedy=========")
        r_base = generate(model_baseline, tokenizer, inputs,
                          num_beams=1,
                          num_return_sequences=args.num_return_sequences,
                          max_new_tokens=args.max_new_tokens)
        print_utils(r_base)
        print("==========finetune: Greedy=========")
        r_finetune_g = generate(model_fintuned, tokenizer, inputs,
                                num_beams=1,
                                num_return_sequences=args.num_return_sequences,
                                max_new_tokens=args.max_new_tokens)
        print_utils(r_finetune_g)

Iterates over each prompt, generates responses from both models using greedy decoding, and prints labeled results.

Example Usage

# Compare a base OPT-1.3B model against an SFT-trained checkpoint
python prompt_eval.py \
    --model_name_or_path_baseline facebook/opt-1.3b \
    --model_name_or_path_finetune output/step1_sft_checkpoint \
    --max_new_tokens 256 \
    --language English

# Compare with Japanese prompts
python prompt_eval.py \
    --model_name_or_path_baseline facebook/opt-1.3b \
    --model_name_or_path_finetune output/step3_rlhf_checkpoint \
    --max_new_tokens 200 \
    --language Japanese

Output Format

For each evaluation prompt, the script prints:

==========Baseline: Greedy=========

[Baseline model response text]

==========finetune: Greedy=========

[Fine-tuned model response text]

====================prompt end=============================

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment