Implementation:Lm sys FastChat Criteria Labeling
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Model_Evaluation |
| Last Updated | 2026-02-07 06:00 GMT |
Overview
Uses LLM-based evaluation to label conversation prompts with difficulty and quality criteria scores across seven predefined dimensions.
Description
Criteria Labeling is an automated annotation module that leverages large language models (via the OpenAI API) to score user prompts along seven quality and difficulty criteria. These criteria are: specificity, domain_knowledge, complexity, problem_solving, creativity, technical_accuracy, and real_world. Each criterion receives a numeric score that characterizes the nature and difficulty of the user prompt, enabling fine-grained analysis of arena conversations beyond simple win/loss outcomes.
The module works by constructing evaluation prompts that ask a judge LLM to assess each user question against the seven criteria. The chat_completion_openai function handles the API communication, supporting configurable model endpoints, temperature, and token limits. The get_answer function orchestrates the full evaluation pipeline: it takes a question, sends it to the judge model, parses the response, and writes results to an answer file. The get_score function extracts numeric scores from the judge model's textual judgment.
This labeling system is used to create category-specific leaderboards and to analyze which types of questions different models excel at. By tagging conversations with difficulty and domain labels, researchers can identify model strengths and weaknesses beyond aggregate Elo ratings.
Usage
Use this module when you need to annotate a corpus of arena conversations with difficulty and topic criteria. It is typically run as a batch process over collected conversation data, producing labeled outputs that feed into category-specific leaderboard generation and analytical reports.
Code Reference
Source Location
- Repository: Lm_sys_FastChat
- File: fastchat/serve/monitor/criteria_labeling.py
- Lines: 1-214
Signature
def get_answer(
question: dict,
max_tokens: int,
temperature: float,
answer_file: str,
api_dict: dict
) -> None:
"""Evaluate a question against all criteria using an LLM judge and write results to a file."""
def chat_completion_openai(
model: str,
messages: list,
temperature: float,
max_tokens: int,
api_dict: dict
) -> str:
"""Send a chat completion request to an OpenAI-compatible API endpoint."""
def get_score(judgment: str) -> dict:
"""Parse a judge model's textual judgment to extract numeric scores for each criterion."""
Import
from fastchat.serve.monitor.criteria_labeling import get_answer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| question | dict | Yes | A dictionary containing the user prompt and metadata to be evaluated |
| max_tokens | int | Yes | Maximum number of tokens for the judge model's response |
| temperature | float | Yes | Sampling temperature for the judge model |
| answer_file | str | Yes | File path where evaluation results are written (JSONL format) |
| api_dict | dict | Yes | API configuration including model name, base URL, and API key |
| model | str | Yes | Model identifier for the judge LLM (used by chat_completion_openai) |
| messages | list[dict] | Yes | Chat messages for the API request (used by chat_completion_openai) |
| judgment | str | Yes | Raw text judgment from the LLM (used by get_score) |
Outputs
| Name | Type | Description |
|---|---|---|
| None | None | get_answer writes results to the answer_file as a side effect |
| response | str | chat_completion_openai returns the judge model's response text |
| scores | dict | get_score returns a dictionary mapping each of the 7 criteria to a numeric score |
Criteria Definitions
| Criterion | Description |
|---|---|
| specificity | How specific and well-defined the user's prompt is |
| domain_knowledge | Level of domain expertise required to answer |
| complexity | Overall complexity of the question |
| problem_solving | Degree of problem-solving or reasoning required |
| creativity | Amount of creative thinking needed |
| technical_accuracy | Level of technical precision demanded |
| real_world | Relevance to real-world applications and scenarios |
Usage Examples
from fastchat.serve.monitor.criteria_labeling import get_answer, get_score
# Configure API endpoint
api_dict = {
"model": "gpt-4",
"api_base": "https://api.openai.com/v1",
"api_key": "sk-..."
}
# Label a single question
question = {
"question_id": "q001",
"text": "Explain the difference between TCP and UDP protocols."
}
get_answer(
question=question,
max_tokens=1024,
temperature=0.0,
answer_file="criteria_results.jsonl",
api_dict=api_dict,
)
# Parse a judgment string
judgment_text = "specificity: 4, domain_knowledge: 3, complexity: 2, ..."
scores = get_score(judgment_text)
print(scores)
# {'specificity': 4, 'domain_knowledge': 3, 'complexity': 2, ...}
Related Pages
- Principle:Lm_sys_FastChat_LLM_Prompt_Classification
- Implements: Principle:Lm_sys_FastChat_LLM_Prompt_Classification
- Lm_sys_FastChat_Topic_Clustering - Clustering prompts by topic for category analysis
- Lm_sys_FastChat_Summarize_Cluster - Summarizing conversation clusters with LLM
- Lm_sys_FastChat_Monitor_Markdown - Category-specific leaderboard rendering