Implementation:Datajuicer Data juicer LLMTaskRelevanceFilter
| Knowledge Sources | |
|---|---|
| Domains | Data_Quality, Filtering |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for filtering data samples based on task relevance scores estimated by an LLM provided by Data-Juicer.
Description
LLMTaskRelevanceFilter is a filter operator that keeps samples with high relevance scores to validation tasks estimated by an LLM. The LLM scores each sample on multiple dimensions: topical relevance, linguistic style match, task match, knowledge alignment, and potential utility. Each dimension is scored on a 1-5 scale. The key metric llm_task_relevance is the average score across these dimensions. Samples are kept if their average score meets or exceeds the specified minimum threshold. The operator requires a validation dataset or task description to be prepared before use. It extends LLMAnalysisFilter and implements the two-phase compute_stats/process pattern.
Usage
Import this operator when you need to filter dataset samples based on their relevance to a specific downstream task. The validation set must be prepared before applying the filter. Configure it in your Data-Juicer YAML config or instantiate directly.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/filter/llm_task_relevance_filter.py
- Lines: 1-193
Signature
@OPERATORS.register_module("llm_task_relevance_filter")
@ATTRIBUTION_FILTERS.register_module("llm_task_relevance_filter")
class LLMTaskRelevanceFilter(LLMAnalysisFilter):
def __init__(
self,
api_or_hf_model: str = "gpt-4o",
min_score: float = 0.5,
is_hf_model: bool = False,
*,
valid_dataset: Optional[List[Dict]] = None,
task_desc: Optional[str] = None,
n_shot: Optional[int] = None,
**kwargs,
):
...
Import
from data_juicer.ops.filter.llm_task_relevance_filter import LLMTaskRelevanceFilter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| api_or_hf_model | str | No | API or HuggingFace model name. Default: "gpt-4o" |
| min_score | float | No | The lowest score threshold to keep the sample. Default: 0.5 |
| is_hf_model | bool | No | Indicates if the model is from HuggingFace. Default: False |
| valid_dataset | Optional[List[Dict]] | No | The dataset to use for validation. Default: None |
| task_desc | Optional[str] | No | The description of the validation task. Default: None |
| n_shot | Optional[int] | No | The number of shots in validation. Default: None |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Filtered samples with stats field updated (llm_task_relevance, llm_task_relevance_record) |
Usage Examples
YAML Configuration
process:
- llm_task_relevance_filter:
api_or_hf_model: "gpt-4o"
min_score: 0.5
Python API
from data_juicer.ops.filter.llm_task_relevance_filter import LLMTaskRelevanceFilter
op = LLMTaskRelevanceFilter(
api_or_hf_model="gpt-4o",
min_score=0.5,
valid_dataset=valid_data,
task_desc="Math problem solving"
)
# Apply to dataset
result = dataset.process(op)