Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer LLMAnalysisFilter

From Leeroopedia
Knowledge Sources
Domains LLM-based Filtering, Data Quality, AI Judge
Last Updated 2026-02-14 16:00 GMT

Overview

Base filter class that leverages an LLM to analyze and score data samples across multiple quality dimensions (clarity, relevance, usefulness, fluency), filtering based on the average score.

Description

This operator serves as the foundation class for LLM-based quality filtering, enabling sophisticated AI-judge-based data curation. It is subclassed by LLMDifficultyScoreFilter, LLMQualityScoreFilter, and LLMTaskRelevanceFilter with specialized prompts.

Architecture: The filter sends each sample to an LLM with a detailed system prompt that instructs the model to return a structured JSON response containing:

  • dimension_scores -- Numerical scores (1-5) for clarity, relevance, usefulness, and fluency
  • tags -- Categorization tags (topic, style)
  • flags -- Issue flags (syntax_error, insufficient_information, etc.)
  • rationale -- Explanation of scoring decisions
  • recommendation -- "keep", "review", or "discard"

Model Support:

  • API-based models (default) -- Uses OpenAI-compatible APIs (e.g., gpt-4o)
  • HuggingFace models -- Local transformer models via the text-generation pipeline
  • vLLM models -- High-performance inference using vLLM engine

Key Methods:

  • build_input() -- Constructs the prompt from sample fields using configurable templates (field_template and input_template)
  • parse_output() -- Extracts JSON from LLM response, computes average dimension score (normalized to 0-1), and extracts tags
  • generate_llm_analysis() -- Manages retries (try_num) and model invocation across all three backends
  • compute_stats_single() -- Caches the llm_analysis_score and llm_analysis_record in sample stats
  • process_single() -- Applies min_score/max_score thresholds; returns True to keep, False to filter

Usage

Configure in YAML with model specification and score thresholds. Supports multi-field input via input_keys/field_names for complex data formats like RFT data with query/analysis/answer fields.

Code Reference

Source Location

Signature

@OPERATORS.register_module("llm_analysis_filter")
class LLMAnalysisFilter(Filter):
    _accelerator = "cuda"

    def __init__(
        self, api_or_hf_model: str = "gpt-4o",
        min_score: float = 0.5, max_score: float = 1.0,
        is_hf_model: bool = False, *,
        api_endpoint: Optional[str] = None,
        response_path: Optional[str] = None,
        input_keys: List[str] = ["text"],
        field_names: List[str] = ["Text"],
        system_prompt: Optional[str] = None,
        input_template: Optional[str] = None,
        field_template: Optional[str] = None,
        try_num: PositiveInt = 3,
        enable_vllm: bool = False,
        model_params: Dict = {},
        sampling_params: Dict = {},
        dim_required_keys: Optional[List[str]] = None,
        **kwargs,
    ): ...
    def build_input(self, sample) -> str: ...
    def parse_output(self, raw_output) -> Tuple[float, dict, dict]: ...
    def compute_stats_single(self, sample, rank=None, context=False): ...
    def process_single(self, sample, rank=None) -> bool: ...

Import

from data_juicer.ops.filter.llm_analysis_filter import LLMAnalysisFilter

I/O Contract

Inputs

Name Type Required Description
api_or_hf_model str No Model name or path (default: "gpt-4o")
min_score float No Minimum average score to keep sample (default: 0.5)
max_score float No Maximum average score to keep sample (default: 1.0)
is_hf_model bool No Use HuggingFace transformer model (default: False)
enable_vllm bool No Use vLLM engine for inference (default: False)
input_keys List[str] No Sample field keys to include in prompt (default: ["text"])
field_names List[str] No Display names for input fields (default: ["Text"])
system_prompt str No Custom system prompt (default: built-in quality assessment prompt)
dim_required_keys List[str] No Dimension keys for score averaging (default: ["clarity", "relevance", "usefulness", "fluency"])
try_num int No Number of retry attempts (default: 3)
sampling_params Dict No Model sampling parameters (e.g., temperature, top_p)

Outputs

Name Type Description
keep bool True to retain the sample, False to filter it out
stats.llm_analysis_score float Average normalized dimension score (0-1)
stats.llm_analysis_record dict Full LLM response including dimension_scores, tags, flags, rationale

Usage Examples

# In YAML config:
# process:
#   - llm_analysis_filter:
#       api_or_hf_model: 'gpt-4o'
#       min_score: 0.6
#       max_score: 1.0
#       input_keys: ['text']
#       field_names: ['Text']
#       dim_required_keys: ['clarity', 'relevance', 'usefulness', 'fluency']

# With vLLM for local inference:
# process:
#   - llm_analysis_filter:
#       api_or_hf_model: 'Qwen/Qwen2-7B-Instruct'
#       enable_vllm: true
#       min_score: 0.5
#       sampling_params:
#         temperature: 0.1
#         max_tokens: 1024

# Multi-field input (e.g., RFT data):
# process:
#   - llm_analysis_filter:
#       api_or_hf_model: 'gpt-4o'
#       input_keys: ['query', 'analysis', 'answer']
#       field_names: ['Query', 'Analysis', 'Answer']

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment