Implementation:Datajuicer Data juicer LLMAnalysisFilter
| Knowledge Sources | |
|---|---|
| Domains | LLM-based Filtering, Data Quality, AI Judge |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Base filter class that leverages an LLM to analyze and score data samples across multiple quality dimensions (clarity, relevance, usefulness, fluency), filtering based on the average score.
Description
This operator serves as the foundation class for LLM-based quality filtering, enabling sophisticated AI-judge-based data curation. It is subclassed by LLMDifficultyScoreFilter, LLMQualityScoreFilter, and LLMTaskRelevanceFilter with specialized prompts.
Architecture: The filter sends each sample to an LLM with a detailed system prompt that instructs the model to return a structured JSON response containing:
- dimension_scores -- Numerical scores (1-5) for clarity, relevance, usefulness, and fluency
- tags -- Categorization tags (topic, style)
- flags -- Issue flags (syntax_error, insufficient_information, etc.)
- rationale -- Explanation of scoring decisions
- recommendation -- "keep", "review", or "discard"
Model Support:
- API-based models (default) -- Uses OpenAI-compatible APIs (e.g., gpt-4o)
- HuggingFace models -- Local transformer models via the text-generation pipeline
- vLLM models -- High-performance inference using vLLM engine
Key Methods:
- build_input() -- Constructs the prompt from sample fields using configurable templates (field_template and input_template)
- parse_output() -- Extracts JSON from LLM response, computes average dimension score (normalized to 0-1), and extracts tags
- generate_llm_analysis() -- Manages retries (try_num) and model invocation across all three backends
- compute_stats_single() -- Caches the llm_analysis_score and llm_analysis_record in sample stats
- process_single() -- Applies min_score/max_score thresholds; returns True to keep, False to filter
Usage
Configure in YAML with model specification and score thresholds. Supports multi-field input via input_keys/field_names for complex data formats like RFT data with query/analysis/answer fields.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/filter/llm_analysis_filter.py
- Lines: 1-304
Signature
@OPERATORS.register_module("llm_analysis_filter")
class LLMAnalysisFilter(Filter):
_accelerator = "cuda"
def __init__(
self, api_or_hf_model: str = "gpt-4o",
min_score: float = 0.5, max_score: float = 1.0,
is_hf_model: bool = False, *,
api_endpoint: Optional[str] = None,
response_path: Optional[str] = None,
input_keys: List[str] = ["text"],
field_names: List[str] = ["Text"],
system_prompt: Optional[str] = None,
input_template: Optional[str] = None,
field_template: Optional[str] = None,
try_num: PositiveInt = 3,
enable_vllm: bool = False,
model_params: Dict = {},
sampling_params: Dict = {},
dim_required_keys: Optional[List[str]] = None,
**kwargs,
): ...
def build_input(self, sample) -> str: ...
def parse_output(self, raw_output) -> Tuple[float, dict, dict]: ...
def compute_stats_single(self, sample, rank=None, context=False): ...
def process_single(self, sample, rank=None) -> bool: ...
Import
from data_juicer.ops.filter.llm_analysis_filter import LLMAnalysisFilter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| api_or_hf_model | str | No | Model name or path (default: "gpt-4o") |
| min_score | float | No | Minimum average score to keep sample (default: 0.5) |
| max_score | float | No | Maximum average score to keep sample (default: 1.0) |
| is_hf_model | bool | No | Use HuggingFace transformer model (default: False) |
| enable_vllm | bool | No | Use vLLM engine for inference (default: False) |
| input_keys | List[str] | No | Sample field keys to include in prompt (default: ["text"]) |
| field_names | List[str] | No | Display names for input fields (default: ["Text"]) |
| system_prompt | str | No | Custom system prompt (default: built-in quality assessment prompt) |
| dim_required_keys | List[str] | No | Dimension keys for score averaging (default: ["clarity", "relevance", "usefulness", "fluency"]) |
| try_num | int | No | Number of retry attempts (default: 3) |
| sampling_params | Dict | No | Model sampling parameters (e.g., temperature, top_p) |
Outputs
| Name | Type | Description |
|---|---|---|
| keep | bool | True to retain the sample, False to filter it out |
| stats.llm_analysis_score | float | Average normalized dimension score (0-1) |
| stats.llm_analysis_record | dict | Full LLM response including dimension_scores, tags, flags, rationale |
Usage Examples
# In YAML config:
# process:
# - llm_analysis_filter:
# api_or_hf_model: 'gpt-4o'
# min_score: 0.6
# max_score: 1.0
# input_keys: ['text']
# field_names: ['Text']
# dim_required_keys: ['clarity', 'relevance', 'usefulness', 'fluency']
# With vLLM for local inference:
# process:
# - llm_analysis_filter:
# api_or_hf_model: 'Qwen/Qwen2-7B-Instruct'
# enable_vllm: true
# min_score: 0.5
# sampling_params:
# temperature: 0.1
# max_tokens: 1024
# Multi-field input (e.g., RFT data):
# process:
# - llm_analysis_filter:
# api_or_hf_model: 'gpt-4o'
# input_keys: ['query', 'analysis', 'answer']
# field_names: ['Query', 'Analysis', 'Answer']