Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Lm sys FastChat Criteria Labeling

From Leeroopedia


Knowledge Sources
Domains Data_Processing, Model_Evaluation
Last Updated 2026-02-07 06:00 GMT

Overview

Uses LLM-based evaluation to label conversation prompts with difficulty and quality criteria scores across seven predefined dimensions.

Description

Criteria Labeling is an automated annotation module that leverages large language models (via the OpenAI API) to score user prompts along seven quality and difficulty criteria. These criteria are: specificity, domain_knowledge, complexity, problem_solving, creativity, technical_accuracy, and real_world. Each criterion receives a numeric score that characterizes the nature and difficulty of the user prompt, enabling fine-grained analysis of arena conversations beyond simple win/loss outcomes.

The module works by constructing evaluation prompts that ask a judge LLM to assess each user question against the seven criteria. The chat_completion_openai function handles the API communication, supporting configurable model endpoints, temperature, and token limits. The get_answer function orchestrates the full evaluation pipeline: it takes a question, sends it to the judge model, parses the response, and writes results to an answer file. The get_score function extracts numeric scores from the judge model's textual judgment.

This labeling system is used to create category-specific leaderboards and to analyze which types of questions different models excel at. By tagging conversations with difficulty and domain labels, researchers can identify model strengths and weaknesses beyond aggregate Elo ratings.

Usage

Use this module when you need to annotate a corpus of arena conversations with difficulty and topic criteria. It is typically run as a batch process over collected conversation data, producing labeled outputs that feed into category-specific leaderboard generation and analytical reports.

Code Reference

Source Location

Signature

def get_answer(
    question: dict,
    max_tokens: int,
    temperature: float,
    answer_file: str,
    api_dict: dict
) -> None:
    """Evaluate a question against all criteria using an LLM judge and write results to a file."""

def chat_completion_openai(
    model: str,
    messages: list,
    temperature: float,
    max_tokens: int,
    api_dict: dict
) -> str:
    """Send a chat completion request to an OpenAI-compatible API endpoint."""

def get_score(judgment: str) -> dict:
    """Parse a judge model's textual judgment to extract numeric scores for each criterion."""

Import

from fastchat.serve.monitor.criteria_labeling import get_answer

I/O Contract

Inputs

Name Type Required Description
question dict Yes A dictionary containing the user prompt and metadata to be evaluated
max_tokens int Yes Maximum number of tokens for the judge model's response
temperature float Yes Sampling temperature for the judge model
answer_file str Yes File path where evaluation results are written (JSONL format)
api_dict dict Yes API configuration including model name, base URL, and API key
model str Yes Model identifier for the judge LLM (used by chat_completion_openai)
messages list[dict] Yes Chat messages for the API request (used by chat_completion_openai)
judgment str Yes Raw text judgment from the LLM (used by get_score)

Outputs

Name Type Description
None None get_answer writes results to the answer_file as a side effect
response str chat_completion_openai returns the judge model's response text
scores dict get_score returns a dictionary mapping each of the 7 criteria to a numeric score

Criteria Definitions

Criterion Description
specificity How specific and well-defined the user's prompt is
domain_knowledge Level of domain expertise required to answer
complexity Overall complexity of the question
problem_solving Degree of problem-solving or reasoning required
creativity Amount of creative thinking needed
technical_accuracy Level of technical precision demanded
real_world Relevance to real-world applications and scenarios

Usage Examples

from fastchat.serve.monitor.criteria_labeling import get_answer, get_score

# Configure API endpoint
api_dict = {
    "model": "gpt-4",
    "api_base": "https://api.openai.com/v1",
    "api_key": "sk-..."
}

# Label a single question
question = {
    "question_id": "q001",
    "text": "Explain the difference between TCP and UDP protocols."
}

get_answer(
    question=question,
    max_tokens=1024,
    temperature=0.0,
    answer_file="criteria_results.jsonl",
    api_dict=api_dict,
)

# Parse a judgment string
judgment_text = "specificity: 4, domain_knowledge: 3, complexity: 2, ..."
scores = get_score(judgment_text)
print(scores)
# {'specificity': 4, 'domain_knowledge': 3, 'complexity': 2, ...}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment