Implementation:Datajuicer Data juicer LLMPerplexityFilter

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data_Quality, Filtering
Last Updated	2026-02-14 16:00 GMT

Overview

Concrete tool for filtering data samples based on LLM perplexity scores provided by Data-Juicer.

Description

LLMPerplexityFilter is a filter operator that keeps samples with perplexity scores within a specified range computed using a specified LLM. It uses a HuggingFace model (default: Qwen/Qwen2.5-0.5B) to compute the perplexity as the exponential of the loss value. The operator formats input text using query and response templates. The key metric llm_perplexity is cached in the stats field. The operator supports CUDA acceleration. It extends the Filter base class and implements the two-phase compute_stats/process pattern.

Usage

Import this operator when you need to filter dataset samples based on the perplexity of text as measured by a language model. Configure it in your Data-Juicer YAML config or instantiate directly.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/ops/filter/llm_perplexity_filter.py
Lines: 1-125

Signature

@OPERATORS.register_module("llm_perplexity_filter")
class LLMPerplexityFilter(Filter):
    def __init__(
        self,
        hf_model: str = "Qwen/Qwen2.5-0.5B",
        model_params: Optional[Dict] = None,
        min_score: float = 1.0,
        max_score: float = 100.0,
        query_template: Optional[str] = None,
        response_template: Optional[str] = None,
        *args,
        **kwargs,
    ):
        ...

Import

from data_juicer.ops.filter.llm_perplexity_filter import LLMPerplexityFilter

I/O Contract

Inputs

Name	Type	Required	Description
hf_model	str	No	HuggingFace model name for computing perplexity. Default: "Qwen/Qwen2.5-0.5B"
model_params	Optional[Dict]	No	Parameters for initializing the model. Default: None
min_score	float	No	Minimum perplexity score to keep samples. Default: 1.0
max_score	float	No	Maximum perplexity score to keep samples. Default: 100.0
query_template	Optional[str]	No	Template for building the query string. Default: None (becomes "")
response_template	Optional[str]	No	Template for building the response string. Default: None (becomes "{text}")

Outputs

Name	Type	Description
samples	Dict	Filtered samples with stats field updated (llm_perplexity)

Usage Examples

YAML Configuration

process:
  - llm_perplexity_filter:
      hf_model: "Qwen/Qwen2.5-0.5B"
      min_score: 1.0
      max_score: 100.0

Python API

from data_juicer.ops.filter.llm_perplexity_filter import LLMPerplexityFilter

op = LLMPerplexityFilter(hf_model="Qwen/Qwen2.5-0.5B", min_score=1.0, max_score=100.0)
# Apply to dataset
result = dataset.process(op)

Related Pages

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment