Implementation:Datajuicer Data juicer LLMAnalysisFilter

Knowledge Sources	Datajuicer_Data_juicer
Domains	LLM-based Filtering, Data Quality, AI Judge
Last Updated	2026-02-14 16:00 GMT

Overview

Base filter class that leverages an LLM to analyze and score data samples across multiple quality dimensions (clarity, relevance, usefulness, fluency), filtering based on the average score.

Description

This operator serves as the foundation class for LLM-based quality filtering, enabling sophisticated AI-judge-based data curation. It is subclassed by LLMDifficultyScoreFilter, LLMQualityScoreFilter, and LLMTaskRelevanceFilter with specialized prompts.

Architecture: The filter sends each sample to an LLM with a detailed system prompt that instructs the model to return a structured JSON response containing:

dimension_scores -- Numerical scores (1-5) for clarity, relevance, usefulness, and fluency
tags -- Categorization tags (topic, style)
flags -- Issue flags (syntax_error, insufficient_information, etc.)
rationale -- Explanation of scoring decisions
recommendation -- "keep", "review", or "discard"

Model Support:

API-based models (default) -- Uses OpenAI-compatible APIs (e.g., gpt-4o)
HuggingFace models -- Local transformer models via the text-generation pipeline
vLLM models -- High-performance inference using vLLM engine

Key Methods:

build_input() -- Constructs the prompt from sample fields using configurable templates (field_template and input_template)
parse_output() -- Extracts JSON from LLM response, computes average dimension score (normalized to 0-1), and extracts tags
generate_llm_analysis() -- Manages retries (try_num) and model invocation across all three backends
compute_stats_single() -- Caches the llm_analysis_score and llm_analysis_record in sample stats
process_single() -- Applies min_score/max_score thresholds; returns True to keep, False to filter

Usage

Configure in YAML with model specification and score thresholds. Supports multi-field input via input_keys/field_names for complex data formats like RFT data with query/analysis/answer fields.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/ops/filter/llm_analysis_filter.py
Lines: 1-304

Signature

@OPERATORS.register_module("llm_analysis_filter")
class LLMAnalysisFilter(Filter):
    _accelerator = "cuda"

    def __init__(
        self, api_or_hf_model: str = "gpt-4o",
        min_score: float = 0.5, max_score: float = 1.0,
        is_hf_model: bool = False, *,
        api_endpoint: Optional[str] = None,
        response_path: Optional[str] = None,
        input_keys: List[str] = ["text"],
        field_names: List[str] = ["Text"],
        system_prompt: Optional[str] = None,
        input_template: Optional[str] = None,
        field_template: Optional[str] = None,
        try_num: PositiveInt = 3,
        enable_vllm: bool = False,
        model_params: Dict = {},
        sampling_params: Dict = {},
        dim_required_keys: Optional[List[str]] = None,
        **kwargs,
    ): ...
    def build_input(self, sample) -> str: ...
    def parse_output(self, raw_output) -> Tuple[float, dict, dict]: ...
    def compute_stats_single(self, sample, rank=None, context=False): ...
    def process_single(self, sample, rank=None) -> bool: ...

Import

from data_juicer.ops.filter.llm_analysis_filter import LLMAnalysisFilter

I/O Contract

Inputs

Name	Type	Required	Description
api_or_hf_model	str	No	Model name or path (default: "gpt-4o")
min_score	float	No	Minimum average score to keep sample (default: 0.5)
max_score	float	No	Maximum average score to keep sample (default: 1.0)
is_hf_model	bool	No	Use HuggingFace transformer model (default: False)
enable_vllm	bool	No	Use vLLM engine for inference (default: False)
input_keys	List[str]	No	Sample field keys to include in prompt (default: ["text"])
field_names	List[str]	No	Display names for input fields (default: ["Text"])
system_prompt	str	No	Custom system prompt (default: built-in quality assessment prompt)
dim_required_keys	List[str]	No	Dimension keys for score averaging (default: ["clarity", "relevance", "usefulness", "fluency"])
try_num	int	No	Number of retry attempts (default: 3)
sampling_params	Dict	No	Model sampling parameters (e.g., temperature, top_p)

Outputs

Name	Type	Description
keep	bool	True to retain the sample, False to filter it out
stats.llm_analysis_score	float	Average normalized dimension score (0-1)
stats.llm_analysis_record	dict	Full LLM response including dimension_scores, tags, flags, rationale

Usage Examples

# In YAML config:
# process:
#   - llm_analysis_filter:
#       api_or_hf_model: 'gpt-4o'
#       min_score: 0.6
#       max_score: 1.0
#       input_keys: ['text']
#       field_names: ['Text']
#       dim_required_keys: ['clarity', 'relevance', 'usefulness', 'fluency']

# With vLLM for local inference:
# process:
#   - llm_analysis_filter:
#       api_or_hf_model: 'Qwen/Qwen2-7B-Instruct'
#       enable_vllm: true
#       min_score: 0.5
#       sampling_params:
#         temperature: 0.1
#         max_tokens: 1024

# Multi-field input (e.g., RFT data):
# process:
#   - llm_analysis_filter:
#       api_or_hf_model: 'gpt-4o'
#       input_keys: ['query', 'analysis', 'answer']
#       field_names: ['Query', 'Analysis', 'Answer']

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment