Implementation:Datajuicer Data juicer LLMPerplexityFilter
| Knowledge Sources | |
|---|---|
| Domains | Data_Quality, Filtering |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for filtering data samples based on LLM perplexity scores provided by Data-Juicer.
Description
LLMPerplexityFilter is a filter operator that keeps samples with perplexity scores within a specified range computed using a specified LLM. It uses a HuggingFace model (default: Qwen/Qwen2.5-0.5B) to compute the perplexity as the exponential of the loss value. The operator formats input text using query and response templates. The key metric llm_perplexity is cached in the stats field. The operator supports CUDA acceleration. It extends the Filter base class and implements the two-phase compute_stats/process pattern.
Usage
Import this operator when you need to filter dataset samples based on the perplexity of text as measured by a language model. Configure it in your Data-Juicer YAML config or instantiate directly.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/filter/llm_perplexity_filter.py
- Lines: 1-125
Signature
@OPERATORS.register_module("llm_perplexity_filter")
class LLMPerplexityFilter(Filter):
def __init__(
self,
hf_model: str = "Qwen/Qwen2.5-0.5B",
model_params: Optional[Dict] = None,
min_score: float = 1.0,
max_score: float = 100.0,
query_template: Optional[str] = None,
response_template: Optional[str] = None,
*args,
**kwargs,
):
...
Import
from data_juicer.ops.filter.llm_perplexity_filter import LLMPerplexityFilter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| hf_model | str | No | HuggingFace model name for computing perplexity. Default: "Qwen/Qwen2.5-0.5B" |
| model_params | Optional[Dict] | No | Parameters for initializing the model. Default: None |
| min_score | float | No | Minimum perplexity score to keep samples. Default: 1.0 |
| max_score | float | No | Maximum perplexity score to keep samples. Default: 100.0 |
| query_template | Optional[str] | No | Template for building the query string. Default: None (becomes "") |
| response_template | Optional[str] | No | Template for building the response string. Default: None (becomes "{text}") |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Filtered samples with stats field updated (llm_perplexity) |
Usage Examples
YAML Configuration
process:
- llm_perplexity_filter:
hf_model: "Qwen/Qwen2.5-0.5B"
min_score: 1.0
max_score: 100.0
Python API
from data_juicer.ops.filter.llm_perplexity_filter import LLMPerplexityFilter
op = LLMPerplexityFilter(hf_model="Qwen/Qwen2.5-0.5B", min_score=1.0, max_score=100.0)
# Apply to dataset
result = dataset.process(op)