Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer LLMPerplexityFilter

From Leeroopedia
Knowledge Sources
Domains Data_Quality, Filtering
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for filtering data samples based on LLM perplexity scores provided by Data-Juicer.

Description

LLMPerplexityFilter is a filter operator that keeps samples with perplexity scores within a specified range computed using a specified LLM. It uses a HuggingFace model (default: Qwen/Qwen2.5-0.5B) to compute the perplexity as the exponential of the loss value. The operator formats input text using query and response templates. The key metric llm_perplexity is cached in the stats field. The operator supports CUDA acceleration. It extends the Filter base class and implements the two-phase compute_stats/process pattern.

Usage

Import this operator when you need to filter dataset samples based on the perplexity of text as measured by a language model. Configure it in your Data-Juicer YAML config or instantiate directly.

Code Reference

Source Location

Signature

@OPERATORS.register_module("llm_perplexity_filter")
class LLMPerplexityFilter(Filter):
    def __init__(
        self,
        hf_model: str = "Qwen/Qwen2.5-0.5B",
        model_params: Optional[Dict] = None,
        min_score: float = 1.0,
        max_score: float = 100.0,
        query_template: Optional[str] = None,
        response_template: Optional[str] = None,
        *args,
        **kwargs,
    ):
        ...

Import

from data_juicer.ops.filter.llm_perplexity_filter import LLMPerplexityFilter

I/O Contract

Inputs

Name Type Required Description
hf_model str No HuggingFace model name for computing perplexity. Default: "Qwen/Qwen2.5-0.5B"
model_params Optional[Dict] No Parameters for initializing the model. Default: None
min_score float No Minimum perplexity score to keep samples. Default: 1.0
max_score float No Maximum perplexity score to keep samples. Default: 100.0
query_template Optional[str] No Template for building the query string. Default: None (becomes "")
response_template Optional[str] No Template for building the response string. Default: None (becomes "{text}")

Outputs

Name Type Description
samples Dict Filtered samples with stats field updated (llm_perplexity)

Usage Examples

YAML Configuration

process:
  - llm_perplexity_filter:
      hf_model: "Qwen/Qwen2.5-0.5B"
      min_score: 1.0
      max_score: 100.0

Python API

from data_juicer.ops.filter.llm_perplexity_filter import LLMPerplexityFilter

op = LLMPerplexityFilter(hf_model="Qwen/Qwen2.5-0.5B", min_score=1.0, max_score=100.0)
# Apply to dataset
result = dataset.process(op)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment