Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer LLMTaskRelevanceFilter

From Leeroopedia
Knowledge Sources
Domains Data_Quality, Filtering
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for filtering data samples based on task relevance scores estimated by an LLM provided by Data-Juicer.

Description

LLMTaskRelevanceFilter is a filter operator that keeps samples with high relevance scores to validation tasks estimated by an LLM. The LLM scores each sample on multiple dimensions: topical relevance, linguistic style match, task match, knowledge alignment, and potential utility. Each dimension is scored on a 1-5 scale. The key metric llm_task_relevance is the average score across these dimensions. Samples are kept if their average score meets or exceeds the specified minimum threshold. The operator requires a validation dataset or task description to be prepared before use. It extends LLMAnalysisFilter and implements the two-phase compute_stats/process pattern.

Usage

Import this operator when you need to filter dataset samples based on their relevance to a specific downstream task. The validation set must be prepared before applying the filter. Configure it in your Data-Juicer YAML config or instantiate directly.

Code Reference

Source Location

Signature

@OPERATORS.register_module("llm_task_relevance_filter")
@ATTRIBUTION_FILTERS.register_module("llm_task_relevance_filter")
class LLMTaskRelevanceFilter(LLMAnalysisFilter):
    def __init__(
        self,
        api_or_hf_model: str = "gpt-4o",
        min_score: float = 0.5,
        is_hf_model: bool = False,
        *,
        valid_dataset: Optional[List[Dict]] = None,
        task_desc: Optional[str] = None,
        n_shot: Optional[int] = None,
        **kwargs,
    ):
        ...

Import

from data_juicer.ops.filter.llm_task_relevance_filter import LLMTaskRelevanceFilter

I/O Contract

Inputs

Name Type Required Description
api_or_hf_model str No API or HuggingFace model name. Default: "gpt-4o"
min_score float No The lowest score threshold to keep the sample. Default: 0.5
is_hf_model bool No Indicates if the model is from HuggingFace. Default: False
valid_dataset Optional[List[Dict]] No The dataset to use for validation. Default: None
task_desc Optional[str] No The description of the validation task. Default: None
n_shot Optional[int] No The number of shots in validation. Default: None

Outputs

Name Type Description
samples Dict Filtered samples with stats field updated (llm_task_relevance, llm_task_relevance_record)

Usage Examples

YAML Configuration

process:
  - llm_task_relevance_filter:
      api_or_hf_model: "gpt-4o"
      min_score: 0.5

Python API

from data_juicer.ops.filter.llm_task_relevance_filter import LLMTaskRelevanceFilter

op = LLMTaskRelevanceFilter(
    api_or_hf_model="gpt-4o",
    min_score=0.5,
    valid_dataset=valid_data,
    task_desc="Math problem solving"
)
# Apply to dataset
result = dataset.process(op)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment