Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Lm sys FastChat Conversation Content Filtering

From Leeroopedia


Field Value
Page Type Principle
Title Conversation Content Filtering
Repository lm-sys/FastChat
Workflow Dataset_Release
Domains Data_Processing, Content_Moderation
Knowledge Sources fastchat/serve/monitor/dataset_release_scripts/arena_33k/filter_bad_conv.py, fastchat/serve/monitor/dataset_release_scripts/lmsys_chat_1m/filter_bad_conv.py
Last Updated 2026-02-07 14:00 GMT

Overview

This principle defines the methodology for detecting and filtering problematic conversations from Arena datasets based on content analysis. Before dataset release, conversations must be screened for characteristics that would bias model evaluation or violate content policies. The filtering pipeline combines rule-based heuristics (language detection, length thresholds, pattern matching) with API-based content classification (toxicity detection via moderation endpoints) to systematically identify and remove low-quality, adversarial, or otherwise unsuitable entries.

Description

Language Detection

Arena datasets intended for English-focused analysis must exclude non-English conversations. Language detection is performed on the concatenated text of each conversation using libraries such as langdetect or fasttext. Conversations detected as non-English (below a confidence threshold) are flagged for removal. Multi-turn conversations where language switches mid-conversation are handled by evaluating the dominant language across all turns.

Toxicity Detection

Conversations containing toxic, harmful, or policy-violating content are identified using API-based moderation endpoints (e.g., the OpenAI moderation API). Each conversation's text is submitted to the moderation endpoint, which returns category-level scores for hate speech, sexual content, violence, self-harm, and other policy categories. Conversations exceeding configurable score thresholds in any category are excluded from the released dataset. API-based detection complements rule-based approaches by catching nuanced or context-dependent violations that simple keyword matching would miss.

Programmatic Content Detection

Conversations that consist primarily of code (e.g., debugging sessions, code generation requests with minimal natural language) are detected and optionally filtered. Code-heavy conversations are identified by heuristics such as the ratio of lines containing common programming syntax (braces, semicolons, import statements) to total lines. While code conversations are valid Arena interactions, they may be excluded from datasets intended for natural language evaluation.

Conversation Length Filtering

Extremely short conversations (e.g., single-word prompts) and extremely long conversations (which may indicate copy-pasted content or automated interactions) are filtered based on configurable length thresholds. Minimum length filters remove trivial interactions that provide no meaningful evaluation signal, while maximum length filters remove outliers that could disproportionately influence aggregate statistics.

Identity and Refusal Pattern Matching

Conversations where the model refuses to respond or reveals its identity (e.g., "As an AI language model, I cannot...") are detected using regex-based pattern matching. These patterns indicate conversations where the model's safety mechanisms were triggered, which may not be representative of typical interaction quality. The pattern library covers common refusal templates across multiple model families and is maintained as a configurable list to accommodate new models added to the Arena.

Theoretical Basis

Content filtering preserves dataset quality by removing entries that would bias model evaluation. In the context of Arena-based model comparison, the goal is to estimate each model's quality on representative, non-trivial, good-faith user prompts. Trivial prompts (single words, greetings) provide no discriminative signal between models. Adversarial inputs (prompt injections, jailbreak attempts) test a narrow safety dimension rather than general capability. Non-English content in an English-focused analysis introduces confounding variation from multilingual ability rather than core task performance. The detection methodology combines rule-based pattern matching -- which offers high precision for known patterns at low computational cost -- with API-based content classification -- which provides broader coverage through learned representations of content policy violations. This two-pronged approach balances recall (catching diverse problematic content) with efficiency (avoiding expensive API calls for easily detectable cases).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment