Implementation:Lm sys FastChat Gen Model Answer
| Field | Value |
|---|---|
| Page Type | Implementation |
| Title | Gen Model Answer |
| Repository | lm-sys/FastChat |
| Knowledge Sources | Source code analysis of fastchat/llm_judge/gen_model_answer.py, fastchat/llm_judge/common.py
|
| Domains | LLM Evaluation, Benchmarking, Model Inference |
| Last Updated | 2026-02-07 14:00 GMT |
| Implements | Principle:Lm_sys_FastChat_MT_Bench_Answer_Generation |
Overview
This implementation provides the machinery for generating model answers to MT-Bench questions. It consists of two primary functions -- run_eval (the orchestrator) and get_model_answers (the inference worker) -- in gen_model_answer.py, along with the load_questions utility from common.py. Together they load a model, iterate through benchmark questions, generate multi-turn responses with category-appropriate temperature settings, and write structured JSONL output.
Description
The answer generation pipeline proceeds as follows:
run_evalloads the question set viaload_questions, shuffles them for load balancing, and dispatches work to one or more GPU workers.- If multiple workers are needed (
num_gpus_total // num_gpus_per_model > 1), Ray is used to distribute chunks of questions across remote workers. - Each worker calls
get_model_answers, which loads the model onto GPU(s), iterates through its assigned questions, and generates responses for both conversation turns. - Per-category temperature is looked up from
temperature_configincommon.py. If the category is not in the config, the default temperature of 0.7 is used. - For each question and each choice (controlled by
num_choices), the function constructs a conversation using the model's template, generates output token-by-token, handles stop tokens and stop strings, strips special tokens, and records the output. - Answers are appended to the output JSONL file. After all workers finish,
reorg_answer_filededuplicates entries byquestion_idand sorts them.
Usage
Command-Line Interface
python3 -m fastchat.llm_judge.gen_model_answer \
--model-path lmsys/vicuna-7b-v1.5 \
--model-id vicuna-7b-v1.5 \
--bench-name mt_bench \
--max-new-token 1024 \
--num-choices 1 \
--num-gpus-per-model 1 \
--num-gpus-total 1 \
--dtype float16
CLI Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--model-path |
str | (required) | Path to model weights (local folder or Hugging Face repo ID) |
--model-id |
str | (required) | Custom name for the model (used in output filenames) |
--bench-name |
str | "mt_bench" |
Name of the benchmark question set |
--question-begin |
int | None | Debug option: begin index of questions |
--question-end |
int | None | Debug option: end index of questions |
--answer-file |
str | None | Custom output answer file path (auto-generated if not set) |
--max-new-token |
int | 1024 | Maximum number of new generated tokens |
--num-choices |
int | 1 | Number of completion choices to generate per question |
--num-gpus-per-model |
int | 1 | Number of GPUs per model instance |
--num-gpus-total |
int | 1 | Total number of GPUs available |
--max-gpu-memory |
str | None | Maximum GPU memory used for model weights per GPU |
--dtype |
str | None | Override default dtype (choices: float32, float16, bfloat16) |
--revision |
str | "main" |
Model revision to load |
Programmatic Usage
from fastchat.llm_judge.gen_model_answer import run_eval
from fastchat.llm_judge.common import load_questions
# Load questions
questions = load_questions("data/mt_bench/question.jsonl", begin=None, end=None)
print(f"Loaded {len(questions)} questions")
# Run evaluation
run_eval(
model_path="lmsys/vicuna-7b-v1.5",
model_id="vicuna-7b-v1.5",
question_file="data/mt_bench/question.jsonl",
question_begin=None,
question_end=None,
answer_file="data/mt_bench/model_answer/vicuna-7b-v1.5.jsonl",
max_new_token=1024,
num_choices=1,
num_gpus_per_model=1,
num_gpus_total=1,
max_gpu_memory=None,
dtype=None,
revision="main",
)
Code Reference
Source Location
| Function | File | Lines |
|---|---|---|
run_eval |
fastchat/llm_judge/gen_model_answer.py |
L21-71 |
get_model_answers |
fastchat/llm_judge/gen_model_answer.py |
L73-190 |
reorg_answer_file |
fastchat/llm_judge/gen_model_answer.py |
L193-204 |
load_questions |
fastchat/llm_judge/common.py |
L88-96 |
temperature_config |
fastchat/llm_judge/common.py |
L40-50 |
Signature
def run_eval(
model_path,
model_id,
question_file,
question_begin,
question_end,
answer_file,
max_new_token,
num_choices,
num_gpus_per_model,
num_gpus_total,
max_gpu_memory,
dtype,
revision,
):
...
@torch.inference_mode()
def get_model_answers(
model_path,
model_id,
questions,
answer_file,
max_new_token,
num_choices,
num_gpus_per_model,
max_gpu_memory,
dtype,
revision,
):
...
def load_questions(question_file: str, begin: Optional[int], end: Optional[int]) -> list[dict]:
...
Import
from fastchat.llm_judge.gen_model_answer import run_eval
from fastchat.llm_judge.common import load_questions, temperature_config
I/O Contract
Inputs
| Input | Format | Description |
|---|---|---|
| Question file | JSONL (data/mt_bench/question.jsonl) |
Each line is a JSON object with fields: question_id (int), category (str), turns (list of 2 strings)
|
| Model weights | Hugging Face model directory or repo ID | The pre-trained model to evaluate (e.g., lmsys/vicuna-7b-v1.5)
|
Example input record:
{
"question_id": 81,
"category": "writing",
"turns": [
"Compose a captivating travel blog post about a recent trip to Hawaii...",
"Rewrite your previous response. Start every sentence with the letter A."
]
}
Outputs
| Output | Format | Description |
|---|---|---|
| Answer file | JSONL (data/mt_bench/model_answer/{model_id}.jsonl) |
Each line is a JSON object with the model's response |
Output record fields:
| Field | Type | Description |
|---|---|---|
question_id |
int | The question identifier, matching the input question |
answer_id |
str | A unique short UUID generated by shortuuid.uuid()
|
model_id |
str | The model identifier string |
choices |
list[dict] | List of choices, each with index (int) and turns (list of response strings)
|
tstamp |
float | Unix timestamp of when the answer was generated |
Example output record:
{
"question_id": 81,
"answer_id": "AbCdEfGhIjKlMnOpQrStUv",
"model_id": "vicuna-7b-v1.5",
"choices": [
{
"index": 0,
"turns": [
"Hawaii is a paradise on Earth...",
"A breathtaking archipelago awaits..."
]
}
],
"tstamp": 1707307200.123
}
Usage Examples
Basic Single-GPU Evaluation
python3 -m fastchat.llm_judge.gen_model_answer \
--model-path lmsys/vicuna-7b-v1.5 \
--model-id vicuna-7b-v1.5
Multi-GPU Parallel Evaluation
python3 -m fastchat.llm_judge.gen_model_answer \
--model-path lmsys/vicuna-13b-v1.5 \
--model-id vicuna-13b-v1.5 \
--num-gpus-per-model 2 \
--num-gpus-total 8 \
--dtype bfloat16
This distributes the workload across 4 workers (8 total GPUs / 2 per model), each running an independent model instance on 2 GPUs.
Evaluating a Subset of Questions
python3 -m fastchat.llm_judge.gen_model_answer \
--model-path lmsys/vicuna-7b-v1.5 \
--model-id vicuna-7b-v1.5 \
--question-begin 0 \
--question-end 10
Related Pages
- Principle:Lm_sys_FastChat_MT_Bench_Answer_Generation
- Principle:Lm_sys_FastChat_MT_Bench_Answer_Generation -- The principle this implementation realizes
- Implementation:Lm_sys_FastChat_Gen_Judgment -- The next step in the pipeline: judging the generated answers
- Implementation:Lm_sys_FastChat_Show_Result -- Displaying evaluation results