Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Lm sys FastChat Gen Model Answer

From Leeroopedia


Field Value
Page Type Implementation
Title Gen Model Answer
Repository lm-sys/FastChat
Knowledge Sources Source code analysis of fastchat/llm_judge/gen_model_answer.py, fastchat/llm_judge/common.py
Domains LLM Evaluation, Benchmarking, Model Inference
Last Updated 2026-02-07 14:00 GMT
Implements Principle:Lm_sys_FastChat_MT_Bench_Answer_Generation

Overview

This implementation provides the machinery for generating model answers to MT-Bench questions. It consists of two primary functions -- run_eval (the orchestrator) and get_model_answers (the inference worker) -- in gen_model_answer.py, along with the load_questions utility from common.py. Together they load a model, iterate through benchmark questions, generate multi-turn responses with category-appropriate temperature settings, and write structured JSONL output.

Description

The answer generation pipeline proceeds as follows:

  1. run_eval loads the question set via load_questions, shuffles them for load balancing, and dispatches work to one or more GPU workers.
  2. If multiple workers are needed (num_gpus_total // num_gpus_per_model > 1), Ray is used to distribute chunks of questions across remote workers.
  3. Each worker calls get_model_answers, which loads the model onto GPU(s), iterates through its assigned questions, and generates responses for both conversation turns.
  4. Per-category temperature is looked up from temperature_config in common.py. If the category is not in the config, the default temperature of 0.7 is used.
  5. For each question and each choice (controlled by num_choices), the function constructs a conversation using the model's template, generates output token-by-token, handles stop tokens and stop strings, strips special tokens, and records the output.
  6. Answers are appended to the output JSONL file. After all workers finish, reorg_answer_file deduplicates entries by question_id and sorts them.

Usage

Command-Line Interface

python3 -m fastchat.llm_judge.gen_model_answer \
    --model-path lmsys/vicuna-7b-v1.5 \
    --model-id vicuna-7b-v1.5 \
    --bench-name mt_bench \
    --max-new-token 1024 \
    --num-choices 1 \
    --num-gpus-per-model 1 \
    --num-gpus-total 1 \
    --dtype float16

CLI Parameters

Parameter Type Default Description
--model-path str (required) Path to model weights (local folder or Hugging Face repo ID)
--model-id str (required) Custom name for the model (used in output filenames)
--bench-name str "mt_bench" Name of the benchmark question set
--question-begin int None Debug option: begin index of questions
--question-end int None Debug option: end index of questions
--answer-file str None Custom output answer file path (auto-generated if not set)
--max-new-token int 1024 Maximum number of new generated tokens
--num-choices int 1 Number of completion choices to generate per question
--num-gpus-per-model int 1 Number of GPUs per model instance
--num-gpus-total int 1 Total number of GPUs available
--max-gpu-memory str None Maximum GPU memory used for model weights per GPU
--dtype str None Override default dtype (choices: float32, float16, bfloat16)
--revision str "main" Model revision to load

Programmatic Usage

from fastchat.llm_judge.gen_model_answer import run_eval
from fastchat.llm_judge.common import load_questions

# Load questions
questions = load_questions("data/mt_bench/question.jsonl", begin=None, end=None)
print(f"Loaded {len(questions)} questions")

# Run evaluation
run_eval(
    model_path="lmsys/vicuna-7b-v1.5",
    model_id="vicuna-7b-v1.5",
    question_file="data/mt_bench/question.jsonl",
    question_begin=None,
    question_end=None,
    answer_file="data/mt_bench/model_answer/vicuna-7b-v1.5.jsonl",
    max_new_token=1024,
    num_choices=1,
    num_gpus_per_model=1,
    num_gpus_total=1,
    max_gpu_memory=None,
    dtype=None,
    revision="main",
)

Code Reference

Source Location

Function File Lines
run_eval fastchat/llm_judge/gen_model_answer.py L21-71
get_model_answers fastchat/llm_judge/gen_model_answer.py L73-190
reorg_answer_file fastchat/llm_judge/gen_model_answer.py L193-204
load_questions fastchat/llm_judge/common.py L88-96
temperature_config fastchat/llm_judge/common.py L40-50

Signature

def run_eval(
    model_path,
    model_id,
    question_file,
    question_begin,
    question_end,
    answer_file,
    max_new_token,
    num_choices,
    num_gpus_per_model,
    num_gpus_total,
    max_gpu_memory,
    dtype,
    revision,
):
    ...
@torch.inference_mode()
def get_model_answers(
    model_path,
    model_id,
    questions,
    answer_file,
    max_new_token,
    num_choices,
    num_gpus_per_model,
    max_gpu_memory,
    dtype,
    revision,
):
    ...
def load_questions(question_file: str, begin: Optional[int], end: Optional[int]) -> list[dict]:
    ...

Import

from fastchat.llm_judge.gen_model_answer import run_eval
from fastchat.llm_judge.common import load_questions, temperature_config

I/O Contract

Inputs

Input Format Description
Question file JSONL (data/mt_bench/question.jsonl) Each line is a JSON object with fields: question_id (int), category (str), turns (list of 2 strings)
Model weights Hugging Face model directory or repo ID The pre-trained model to evaluate (e.g., lmsys/vicuna-7b-v1.5)

Example input record:

{
    "question_id": 81,
    "category": "writing",
    "turns": [
        "Compose a captivating travel blog post about a recent trip to Hawaii...",
        "Rewrite your previous response. Start every sentence with the letter A."
    ]
}

Outputs

Output Format Description
Answer file JSONL (data/mt_bench/model_answer/{model_id}.jsonl) Each line is a JSON object with the model's response

Output record fields:

Field Type Description
question_id int The question identifier, matching the input question
answer_id str A unique short UUID generated by shortuuid.uuid()
model_id str The model identifier string
choices list[dict] List of choices, each with index (int) and turns (list of response strings)
tstamp float Unix timestamp of when the answer was generated

Example output record:

{
    "question_id": 81,
    "answer_id": "AbCdEfGhIjKlMnOpQrStUv",
    "model_id": "vicuna-7b-v1.5",
    "choices": [
        {
            "index": 0,
            "turns": [
                "Hawaii is a paradise on Earth...",
                "A breathtaking archipelago awaits..."
            ]
        }
    ],
    "tstamp": 1707307200.123
}

Usage Examples

Basic Single-GPU Evaluation

python3 -m fastchat.llm_judge.gen_model_answer \
    --model-path lmsys/vicuna-7b-v1.5 \
    --model-id vicuna-7b-v1.5

Multi-GPU Parallel Evaluation

python3 -m fastchat.llm_judge.gen_model_answer \
    --model-path lmsys/vicuna-13b-v1.5 \
    --model-id vicuna-13b-v1.5 \
    --num-gpus-per-model 2 \
    --num-gpus-total 8 \
    --dtype bfloat16

This distributes the workload across 4 workers (8 total GPUs / 2 per model), each running an independent model instance on 2 GPUs.

Evaluating a Subset of Questions

python3 -m fastchat.llm_judge.gen_model_answer \
    --model-path lmsys/vicuna-7b-v1.5 \
    --model-id vicuna-7b-v1.5 \
    --question-begin 0 \
    --question-end 10

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment