Implementation:Datajuicer Data juicer LLMTaskRelevanceFilter

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data_Quality, Filtering
Last Updated	2026-02-14 16:00 GMT

Overview

Concrete tool for filtering data samples based on task relevance scores estimated by an LLM provided by Data-Juicer.

Description

LLMTaskRelevanceFilter is a filter operator that keeps samples with high relevance scores to validation tasks estimated by an LLM. The LLM scores each sample on multiple dimensions: topical relevance, linguistic style match, task match, knowledge alignment, and potential utility. Each dimension is scored on a 1-5 scale. The key metric llm_task_relevance is the average score across these dimensions. Samples are kept if their average score meets or exceeds the specified minimum threshold. The operator requires a validation dataset or task description to be prepared before use. It extends LLMAnalysisFilter and implements the two-phase compute_stats/process pattern.

Usage

Import this operator when you need to filter dataset samples based on their relevance to a specific downstream task. The validation set must be prepared before applying the filter. Configure it in your Data-Juicer YAML config or instantiate directly.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/ops/filter/llm_task_relevance_filter.py
Lines: 1-193

Signature

@OPERATORS.register_module("llm_task_relevance_filter")
@ATTRIBUTION_FILTERS.register_module("llm_task_relevance_filter")
class LLMTaskRelevanceFilter(LLMAnalysisFilter):
    def __init__(
        self,
        api_or_hf_model: str = "gpt-4o",
        min_score: float = 0.5,
        is_hf_model: bool = False,
        *,
        valid_dataset: Optional[List[Dict]] = None,
        task_desc: Optional[str] = None,
        n_shot: Optional[int] = None,
        **kwargs,
    ):
        ...

Import

from data_juicer.ops.filter.llm_task_relevance_filter import LLMTaskRelevanceFilter

I/O Contract

Inputs

Name	Type	Required	Description
api_or_hf_model	str	No	API or HuggingFace model name. Default: "gpt-4o"
min_score	float	No	The lowest score threshold to keep the sample. Default: 0.5
is_hf_model	bool	No	Indicates if the model is from HuggingFace. Default: False
valid_dataset	Optional[List[Dict]]	No	The dataset to use for validation. Default: None
task_desc	Optional[str]	No	The description of the validation task. Default: None
n_shot	Optional[int]	No	The number of shots in validation. Default: None

Outputs

Name	Type	Description
samples	Dict	Filtered samples with stats field updated (llm_task_relevance, llm_task_relevance_record)

Usage Examples

YAML Configuration

process:
  - llm_task_relevance_filter:
      api_or_hf_model: "gpt-4o"
      min_score: 0.5

Python API

from data_juicer.ops.filter.llm_task_relevance_filter import LLMTaskRelevanceFilter

op = LLMTaskRelevanceFilter(
    api_or_hf_model="gpt-4o",
    min_score=0.5,
    valid_dataset=valid_data,
    task_desc="Math problem solving"
)
# Apply to dataset
result = dataset.process(op)

Related Pages

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment