Implementation:Microsoft DeepSpeedExamples Prompt Eval
Type
Pattern Doc (CLI script)
Overview
Concrete tool for side-by-side evaluation of baseline vs fine-tuned language models provided by the DeepSpeed-Chat library.
Description
prompt_eval.py is a command-line evaluation script that performs qualitative side-by-side comparison of two language models. It:
- Loads a baseline model and a fine-tuned model using
create_hf_model()from the DeepSpeed-Chat utilities. - Loads a shared tokenizer from the baseline model path.
- Selects a set of built-in evaluation prompts based on the specified language (English, Chinese, or Japanese).
- For each prompt, generates responses from both models using greedy decoding.
- Prints the responses in a clearly labeled side-by-side format for human inspection.
The script uses deepspeed.get_accelerator() for device placement, placing both models on GPU 0. It supports additional decoding strategies (beam search, multinomial sampling, diverse beam search, contrastive search) through commented-out code blocks that users can enable as needed.
Code Reference
- File:
applications/DeepSpeed-Chat/training/step1_supervised_finetuning/prompt_eval.py - Lines: 1-135
Signature (CLI Arguments)
python prompt_eval.py \
--model_name_or_path_baseline <path_to_baseline> \
--model_name_or_path_finetune <path_to_finetuned> \
--max_new_tokens 100 \
--language English
Full Argument Reference
| Argument | Type | Default | Required | Description |
|---|---|---|---|---|
--model_name_or_path_baseline |
str |
— | Yes | Path to the baseline model (pre-trained or earlier checkpoint) |
--model_name_or_path_finetune |
str |
— | Yes | Path to the fine-tuned model to evaluate |
--num_beams |
int |
1 | No | Number of beams for beam search decoding |
--num_beam_groups |
int |
1 | No | Number of beam groups for diverse beam search |
--top_k |
int |
4 | No | Top-k parameter for contrastive search |
--penalty_alpha |
float |
0.6 | No | Degeneration penalty for contrastive search |
--num_return_sequences |
int |
1 | No | Number of sequences to return per prompt |
--max_new_tokens |
int |
100 | No | Maximum number of new tokens to generate per response |
--language |
str |
English | No | Language for evaluation prompts (English, Chinese, or Japanese) |
--add_eot_token |
flag | False | No | Add <|endoftext|> as an additional special token to the tokenizer |
Built-in Evaluation Prompts
English
| # | Prompt |
|---|---|
| 1 | Human: Please tell me about Microsoft in a few sentence? Assistant: |
| 2 | Human: Explain the moon landing to a 6 year old in a few sentences. Assistant: |
| 3 | Human: Write a short poem about a wise frog. Assistant: |
| 4 | Human: Who was president of the United States in 1955? Assistant: |
| 5 | Human: How does a telescope work? Assistant: |
| 6 | Human: Why do birds migrate south for the winter? Assistant: |
Chinese
| # | Prompt |
|---|---|
| 1 | Human: 请用几句话介绍一下微软? Assistant: |
| 2 | Human: 用几句话向6岁的孩子解释登月。 Assistant: |
| 3 | Human: 写一首关于一只聪明的青蛙的短诗。 Assistant: |
| 4 | Human: 谁是1955年的美国总统? Assistant: |
| 5 | Human: 望远镜是如何工作的? Assistant: |
| 6 | Human: 鸟类为什么要南迁过冬? Assistant: |
Japanese
| # | Prompt |
|---|---|
| 1 | Human: マイクロソフトについて簡単に教えてください。 Assistant: |
| 2 | Human: 6歳児に月面着陸を短い文で説明する。 Assistant: |
| 3 | Human: 賢いカエルについて短い詩を書いてください。 Assistant: |
| 4 | Human: 1955年のアメリカ合衆国大統領は誰? Assistant: |
| 5 | Human: 望遠鏡はどのように機能しますか? Assistant: |
| 6 | Human: 鳥が冬に南に移動するのはなぜですか? Assistant: |
Key Internal Functions
generate()
def generate(model, tokenizer, inputs, num_beams=1, num_beam_groups=1,
do_sample=False, num_return_sequences=1, max_new_tokens=100):
generate_ids = model.generate(
inputs.input_ids,
num_beams=num_beams,
num_beam_groups=num_beam_groups,
do_sample=do_sample,
num_return_sequences=num_return_sequences,
max_new_tokens=max_new_tokens
)
result = tokenizer.batch_decode(generate_ids, skip_special_tokens=True,
clean_up_tokenization_spaces=False)
return result
Performs standard HuggingFace model.generate() with configurable decoding parameters and decodes the output tokens back to text.
generate_constrastive_search()
def generate_constrastive_search(model, tokenizer, inputs, top_k=4,
penalty_alpha=0.6, num_return_sequences=1,
max_new_tokens=100):
generate_ids = model.generate(
inputs.input_ids,
top_k=top_k,
penalty_alpha=penalty_alpha,
num_return_sequences=num_return_sequences,
max_new_tokens=max_new_tokens
)
result = tokenizer.batch_decode(generate_ids, skip_special_tokens=True,
clean_up_tokenization_spaces=False)
return result
Uses contrastive search decoding with a degeneration penalty to reduce repetitive outputs.
prompt_eval()
def prompt_eval(args, model_baseline, model_fintuned, tokenizer, device, prompts):
for prompt in prompts:
inputs = tokenizer(prompt, return_tensors="pt").to(device)
print("==========Baseline: Greedy=========")
r_base = generate(model_baseline, tokenizer, inputs,
num_beams=1,
num_return_sequences=args.num_return_sequences,
max_new_tokens=args.max_new_tokens)
print_utils(r_base)
print("==========finetune: Greedy=========")
r_finetune_g = generate(model_fintuned, tokenizer, inputs,
num_beams=1,
num_return_sequences=args.num_return_sequences,
max_new_tokens=args.max_new_tokens)
print_utils(r_finetune_g)
Iterates over each prompt, generates responses from both models using greedy decoding, and prints labeled results.
Example Usage
# Compare a base OPT-1.3B model against an SFT-trained checkpoint
python prompt_eval.py \
--model_name_or_path_baseline facebook/opt-1.3b \
--model_name_or_path_finetune output/step1_sft_checkpoint \
--max_new_tokens 256 \
--language English
# Compare with Japanese prompts
python prompt_eval.py \
--model_name_or_path_baseline facebook/opt-1.3b \
--model_name_or_path_finetune output/step3_rlhf_checkpoint \
--max_new_tokens 200 \
--language Japanese
Output Format
For each evaluation prompt, the script prints:
==========Baseline: Greedy=========
[Baseline model response text]
==========finetune: Greedy=========
[Fine-tuned model response text]
====================prompt end=============================