Implementation:Microsoft DeepSpeedExamples Prompt Eval

Type

Pattern Doc (CLI script)

Overview

Concrete tool for side-by-side evaluation of baseline vs fine-tuned language models provided by the DeepSpeed-Chat library.

Description

prompt_eval.py is a command-line evaluation script that performs qualitative side-by-side comparison of two language models. It:

Loads a baseline model and a fine-tuned model using create_hf_model() from the DeepSpeed-Chat utilities.
Loads a shared tokenizer from the baseline model path.
Selects a set of built-in evaluation prompts based on the specified language (English, Chinese, or Japanese).
For each prompt, generates responses from both models using greedy decoding.
Prints the responses in a clearly labeled side-by-side format for human inspection.

The script uses deepspeed.get_accelerator() for device placement, placing both models on GPU 0. It supports additional decoding strategies (beam search, multinomial sampling, diverse beam search, contrastive search) through commented-out code blocks that users can enable as needed.

Code Reference

File: applications/DeepSpeed-Chat/training/step1_supervised_finetuning/prompt_eval.py
Lines: 1-135

Signature (CLI Arguments)

python prompt_eval.py \
    --model_name_or_path_baseline <path_to_baseline> \
    --model_name_or_path_finetune <path_to_finetuned> \
    --max_new_tokens 100 \
    --language English

Full Argument Reference

Argument	Type	Default	Required	Description
`--model_name_or_path_baseline`	`str`	—	Yes	Path to the baseline model (pre-trained or earlier checkpoint)
`--model_name_or_path_finetune`	`str`	—	Yes	Path to the fine-tuned model to evaluate
`--num_beams`	`int`	1	No	Number of beams for beam search decoding
`--num_beam_groups`	`int`	1	No	Number of beam groups for diverse beam search
`--top_k`	`int`	4	No	Top-k parameter for contrastive search
`--penalty_alpha`	`float`	0.6	No	Degeneration penalty for contrastive search
`--num_return_sequences`	`int`	1	No	Number of sequences to return per prompt
`--max_new_tokens`	`int`	100	No	Maximum number of new tokens to generate per response
`--language`	`str`	English	No	Language for evaluation prompts (English, Chinese, or Japanese)
`--add_eot_token`	flag	False	No	Add <\|endoftext\|> as an additional special token to the tokenizer

Built-in Evaluation Prompts

English

#	Prompt
1	Human: Please tell me about Microsoft in a few sentence? Assistant:
2	Human: Explain the moon landing to a 6 year old in a few sentences. Assistant:
3	Human: Write a short poem about a wise frog. Assistant:
4	Human: Who was president of the United States in 1955? Assistant:
5	Human: How does a telescope work? Assistant:
6	Human: Why do birds migrate south for the winter? Assistant:

Chinese

#	Prompt
1	Human: 请用几句话介绍一下微软? Assistant:
2	Human: 用几句话向6岁的孩子解释登月。 Assistant:
3	Human: 写一首关于一只聪明的青蛙的短诗。 Assistant:
4	Human: 谁是1955年的美国总统? Assistant:
5	Human: 望远镜是如何工作的? Assistant:
6	Human: 鸟类为什么要南迁过冬? Assistant:

Japanese

#	Prompt
1	Human: マイクロソフトについて簡単に教えてください。 Assistant:
2	Human: 6歳児に月面着陸を短い文で説明する。 Assistant:
3	Human: 賢いカエルについて短い詩を書いてください。 Assistant:
4	Human: 1955年のアメリカ合衆国大統領は誰? Assistant:
5	Human: 望遠鏡はどのように機能しますか? Assistant:
6	Human: 鳥が冬に南に移動するのはなぜですか? Assistant:

Key Internal Functions

generate()

def generate(model, tokenizer, inputs, num_beams=1, num_beam_groups=1,
             do_sample=False, num_return_sequences=1, max_new_tokens=100):
    generate_ids = model.generate(
        inputs.input_ids,
        num_beams=num_beams,
        num_beam_groups=num_beam_groups,
        do_sample=do_sample,
        num_return_sequences=num_return_sequences,
        max_new_tokens=max_new_tokens
    )
    result = tokenizer.batch_decode(generate_ids, skip_special_tokens=True,
                                    clean_up_tokenization_spaces=False)
    return result

Performs standard HuggingFace model.generate() with configurable decoding parameters and decodes the output tokens back to text.

generate_constrastive_search()

def generate_constrastive_search(model, tokenizer, inputs, top_k=4,
                                  penalty_alpha=0.6, num_return_sequences=1,
                                  max_new_tokens=100):
    generate_ids = model.generate(
        inputs.input_ids,
        top_k=top_k,
        penalty_alpha=penalty_alpha,
        num_return_sequences=num_return_sequences,
        max_new_tokens=max_new_tokens
    )
    result = tokenizer.batch_decode(generate_ids, skip_special_tokens=True,
                                    clean_up_tokenization_spaces=False)
    return result

Uses contrastive search decoding with a degeneration penalty to reduce repetitive outputs.

prompt_eval()

def prompt_eval(args, model_baseline, model_fintuned, tokenizer, device, prompts):
    for prompt in prompts:
        inputs = tokenizer(prompt, return_tensors="pt").to(device)
        print("==========Baseline: Greedy=========")
        r_base = generate(model_baseline, tokenizer, inputs,
                          num_beams=1,
                          num_return_sequences=args.num_return_sequences,
                          max_new_tokens=args.max_new_tokens)
        print_utils(r_base)
        print("==========finetune: Greedy=========")
        r_finetune_g = generate(model_fintuned, tokenizer, inputs,
                                num_beams=1,
                                num_return_sequences=args.num_return_sequences,
                                max_new_tokens=args.max_new_tokens)
        print_utils(r_finetune_g)

Iterates over each prompt, generates responses from both models using greedy decoding, and prints labeled results.

Example Usage

# Compare a base OPT-1.3B model against an SFT-trained checkpoint
python prompt_eval.py \
    --model_name_or_path_baseline facebook/opt-1.3b \
    --model_name_or_path_finetune output/step1_sft_checkpoint \
    --max_new_tokens 256 \
    --language English

# Compare with Japanese prompts
python prompt_eval.py \
    --model_name_or_path_baseline facebook/opt-1.3b \
    --model_name_or_path_finetune output/step3_rlhf_checkpoint \
    --max_new_tokens 200 \
    --language Japanese

Output Format

For each evaluation prompt, the script prints:

==========Baseline: Greedy=========

[Baseline model response text]

==========finetune: Greedy=========

[Fine-tuned model response text]

====================prompt end=============================

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment