Implementation:Lm sys FastChat Gen Model Answer

Field	Value
Page Type	Implementation
Title	Gen Model Answer
Repository	lm-sys/FastChat
Knowledge Sources	Source code analysis of `fastchat/llm_judge/gen_model_answer.py`, `fastchat/llm_judge/common.py`
Domains	LLM Evaluation, Benchmarking, Model Inference
Last Updated	2026-02-07 14:00 GMT
Implements	Principle:Lm_sys_FastChat_MT_Bench_Answer_Generation

Overview

This implementation provides the machinery for generating model answers to MT-Bench questions. It consists of two primary functions -- run_eval (the orchestrator) and get_model_answers (the inference worker) -- in gen_model_answer.py, along with the load_questions utility from common.py. Together they load a model, iterate through benchmark questions, generate multi-turn responses with category-appropriate temperature settings, and write structured JSONL output.

Description

The answer generation pipeline proceeds as follows:

run_eval loads the question set via load_questions, shuffles them for load balancing, and dispatches work to one or more GPU workers.
If multiple workers are needed (num_gpus_total // num_gpus_per_model > 1), Ray is used to distribute chunks of questions across remote workers.
Each worker calls get_model_answers, which loads the model onto GPU(s), iterates through its assigned questions, and generates responses for both conversation turns.
Per-category temperature is looked up from temperature_config in common.py. If the category is not in the config, the default temperature of 0.7 is used.
For each question and each choice (controlled by num_choices), the function constructs a conversation using the model's template, generates output token-by-token, handles stop tokens and stop strings, strips special tokens, and records the output.
Answers are appended to the output JSONL file. After all workers finish, reorg_answer_file deduplicates entries by question_id and sorts them.

Usage

Command-Line Interface

python3 -m fastchat.llm_judge.gen_model_answer \
    --model-path lmsys/vicuna-7b-v1.5 \
    --model-id vicuna-7b-v1.5 \
    --bench-name mt_bench \
    --max-new-token 1024 \
    --num-choices 1 \
    --num-gpus-per-model 1 \
    --num-gpus-total 1 \
    --dtype float16

CLI Parameters

Parameter	Type	Default	Description
`--model-path`	str	(required)	Path to model weights (local folder or Hugging Face repo ID)
`--model-id`	str	(required)	Custom name for the model (used in output filenames)
`--bench-name`	str	`"mt_bench"`	Name of the benchmark question set
`--question-begin`	int	None	Debug option: begin index of questions
`--question-end`	int	None	Debug option: end index of questions
`--answer-file`	str	None	Custom output answer file path (auto-generated if not set)
`--max-new-token`	int	1024	Maximum number of new generated tokens
`--num-choices`	int	1	Number of completion choices to generate per question
`--num-gpus-per-model`	int	1	Number of GPUs per model instance
`--num-gpus-total`	int	1	Total number of GPUs available
`--max-gpu-memory`	str	None	Maximum GPU memory used for model weights per GPU
`--dtype`	str	None	Override default dtype (choices: float32, float16, bfloat16)
`--revision`	str	`"main"`	Model revision to load

Programmatic Usage

from fastchat.llm_judge.gen_model_answer import run_eval
from fastchat.llm_judge.common import load_questions

# Load questions
questions = load_questions("data/mt_bench/question.jsonl", begin=None, end=None)
print(f"Loaded {len(questions)} questions")

# Run evaluation
run_eval(
    model_path="lmsys/vicuna-7b-v1.5",
    model_id="vicuna-7b-v1.5",
    question_file="data/mt_bench/question.jsonl",
    question_begin=None,
    question_end=None,
    answer_file="data/mt_bench/model_answer/vicuna-7b-v1.5.jsonl",
    max_new_token=1024,
    num_choices=1,
    num_gpus_per_model=1,
    num_gpus_total=1,
    max_gpu_memory=None,
    dtype=None,
    revision="main",
)

Code Reference

Source Location

Function	File	Lines
`run_eval`	`fastchat/llm_judge/gen_model_answer.py`	L21-71
`get_model_answers`	`fastchat/llm_judge/gen_model_answer.py`	L73-190
`reorg_answer_file`	`fastchat/llm_judge/gen_model_answer.py`	L193-204
`load_questions`	`fastchat/llm_judge/common.py`	L88-96
`temperature_config`	`fastchat/llm_judge/common.py`	L40-50

Signature

def run_eval(
    model_path,
    model_id,
    question_file,
    question_begin,
    question_end,
    answer_file,
    max_new_token,
    num_choices,
    num_gpus_per_model,
    num_gpus_total,
    max_gpu_memory,
    dtype,
    revision,
):
    ...

@torch.inference_mode()
def get_model_answers(
    model_path,
    model_id,
    questions,
    answer_file,
    max_new_token,
    num_choices,
    num_gpus_per_model,
    max_gpu_memory,
    dtype,
    revision,
):
    ...

def load_questions(question_file: str, begin: Optional[int], end: Optional[int]) -> list[dict]:
    ...

Import

from fastchat.llm_judge.gen_model_answer import run_eval
from fastchat.llm_judge.common import load_questions, temperature_config

I/O Contract

Inputs

Input	Format	Description
Question file	JSONL (`data/mt_bench/question.jsonl`)	Each line is a JSON object with fields: `question_id` (int), `category` (str), `turns` (list of 2 strings)
Model weights	Hugging Face model directory or repo ID	The pre-trained model to evaluate (e.g., `lmsys/vicuna-7b-v1.5`)

Example input record:

{
    "question_id": 81,
    "category": "writing",
    "turns": [
        "Compose a captivating travel blog post about a recent trip to Hawaii...",
        "Rewrite your previous response. Start every sentence with the letter A."
    ]
}

Outputs

Output	Format	Description
Answer file	JSONL (`data/mt_bench/model_answer/{model_id}.jsonl`)	Each line is a JSON object with the model's response

Output record fields:

Field	Type	Description
`question_id`	int	The question identifier, matching the input question
`answer_id`	str	A unique short UUID generated by `shortuuid.uuid()`
`model_id`	str	The model identifier string
`choices`	list[dict]	List of choices, each with `index` (int) and `turns` (list of response strings)
`tstamp`	float	Unix timestamp of when the answer was generated

Example output record:

{
    "question_id": 81,
    "answer_id": "AbCdEfGhIjKlMnOpQrStUv",
    "model_id": "vicuna-7b-v1.5",
    "choices": [
        {
            "index": 0,
            "turns": [
                "Hawaii is a paradise on Earth...",
                "A breathtaking archipelago awaits..."
            ]
        }
    ],
    "tstamp": 1707307200.123
}

Usage Examples

Basic Single-GPU Evaluation

python3 -m fastchat.llm_judge.gen_model_answer \
    --model-path lmsys/vicuna-7b-v1.5 \
    --model-id vicuna-7b-v1.5

Multi-GPU Parallel Evaluation

python3 -m fastchat.llm_judge.gen_model_answer \
    --model-path lmsys/vicuna-13b-v1.5 \
    --model-id vicuna-13b-v1.5 \
    --num-gpus-per-model 2 \
    --num-gpus-total 8 \
    --dtype bfloat16

This distributes the workload across 4 workers (8 total GPUs / 2 per model), each running an independent model instance on 2 GPUs.

Evaluating a Subset of Questions

python3 -m fastchat.llm_judge.gen_model_answer \
    --model-path lmsys/vicuna-7b-v1.5 \
    --model-id vicuna-7b-v1.5 \
    --question-begin 0 \
    --question-end 10

Related Pages

Principle:Lm_sys_FastChat_MT_Bench_Answer_Generation
Principle:Lm_sys_FastChat_MT_Bench_Answer_Generation -- The principle this implementation realizes
Implementation:Lm_sys_FastChat_Gen_Judgment -- The next step in the pipeline: judging the generated answers
Implementation:Lm_sys_FastChat_Show_Result -- Displaying evaluation results

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment