Implementation:Lm sys FastChat Criteria Labeling

Knowledge Sources	Lm_sys_FastChat
Domains	Data_Processing, Model_Evaluation
Last Updated	2026-02-07 06:00 GMT

Overview

Uses LLM-based evaluation to label conversation prompts with difficulty and quality criteria scores across seven predefined dimensions.

Description

Criteria Labeling is an automated annotation module that leverages large language models (via the OpenAI API) to score user prompts along seven quality and difficulty criteria. These criteria are: specificity, domain_knowledge, complexity, problem_solving, creativity, technical_accuracy, and real_world. Each criterion receives a numeric score that characterizes the nature and difficulty of the user prompt, enabling fine-grained analysis of arena conversations beyond simple win/loss outcomes.

The module works by constructing evaluation prompts that ask a judge LLM to assess each user question against the seven criteria. The chat_completion_openai function handles the API communication, supporting configurable model endpoints, temperature, and token limits. The get_answer function orchestrates the full evaluation pipeline: it takes a question, sends it to the judge model, parses the response, and writes results to an answer file. The get_score function extracts numeric scores from the judge model's textual judgment.

This labeling system is used to create category-specific leaderboards and to analyze which types of questions different models excel at. By tagging conversations with difficulty and domain labels, researchers can identify model strengths and weaknesses beyond aggregate Elo ratings.

Usage

Use this module when you need to annotate a corpus of arena conversations with difficulty and topic criteria. It is typically run as a batch process over collected conversation data, producing labeled outputs that feed into category-specific leaderboard generation and analytical reports.

Code Reference

Source Location

Repository: Lm_sys_FastChat
File: fastchat/serve/monitor/criteria_labeling.py
Lines: 1-214

Signature

def get_answer(
    question: dict,
    max_tokens: int,
    temperature: float,
    answer_file: str,
    api_dict: dict
) -> None:
    """Evaluate a question against all criteria using an LLM judge and write results to a file."""

def chat_completion_openai(
    model: str,
    messages: list,
    temperature: float,
    max_tokens: int,
    api_dict: dict
) -> str:
    """Send a chat completion request to an OpenAI-compatible API endpoint."""

def get_score(judgment: str) -> dict:
    """Parse a judge model's textual judgment to extract numeric scores for each criterion."""

Import

from fastchat.serve.monitor.criteria_labeling import get_answer

I/O Contract

Inputs

Name	Type	Required	Description
question	dict	Yes	A dictionary containing the user prompt and metadata to be evaluated
max_tokens	int	Yes	Maximum number of tokens for the judge model's response
temperature	float	Yes	Sampling temperature for the judge model
answer_file	str	Yes	File path where evaluation results are written (JSONL format)
api_dict	dict	Yes	API configuration including model name, base URL, and API key
model	str	Yes	Model identifier for the judge LLM (used by chat_completion_openai)
messages	list[dict]	Yes	Chat messages for the API request (used by chat_completion_openai)
judgment	str	Yes	Raw text judgment from the LLM (used by get_score)

Outputs

Name	Type	Description
None	None	get_answer writes results to the answer_file as a side effect
response	str	chat_completion_openai returns the judge model's response text
scores	dict	get_score returns a dictionary mapping each of the 7 criteria to a numeric score

Criteria Definitions

Criterion	Description
specificity	How specific and well-defined the user's prompt is
domain_knowledge	Level of domain expertise required to answer
complexity	Overall complexity of the question
problem_solving	Degree of problem-solving or reasoning required
creativity	Amount of creative thinking needed
technical_accuracy	Level of technical precision demanded
real_world	Relevance to real-world applications and scenarios

Usage Examples

from fastchat.serve.monitor.criteria_labeling import get_answer, get_score

# Configure API endpoint
api_dict = {
    "model": "gpt-4",
    "api_base": "https://api.openai.com/v1",
    "api_key": "sk-..."
}

# Label a single question
question = {
    "question_id": "q001",
    "text": "Explain the difference between TCP and UDP protocols."
}

get_answer(
    question=question,
    max_tokens=1024,
    temperature=0.0,
    answer_file="criteria_results.jsonl",
    api_dict=api_dict,
)

# Parse a judgment string
judgment_text = "specificity: 4, domain_knowledge: 3, complexity: 2, ..."
scores = get_score(judgment_text)
print(scores)
# {'specificity': 4, 'domain_knowledge': 3, 'complexity': 2, ...}

Related Pages

Principle:Lm_sys_FastChat_LLM_Prompt_Classification
Implements: Principle:Lm_sys_FastChat_LLM_Prompt_Classification
Lm_sys_FastChat_Topic_Clustering - Clustering prompts by topic for category analysis
Lm_sys_FastChat_Summarize_Cluster - Summarizing conversation clusters with LLM
Lm_sys_FastChat_Monitor_Markdown - Category-specific leaderboard rendering

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment