Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer Quality Classifier Eval

From Leeroopedia
Knowledge Sources
Domains Tooling
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for evaluating trained quality classifier models against labeled datasets provided by Data-Juicer.

Description

Quality_Classifier_Eval is a CLI tool for evaluating a trained quality classifier model against labeled positive and negative datasets, reporting classification performance metrics. It initializes a PySpark session, loads positive datasets (label=1) and negative datasets (label=0) using qc_utils, merges them into a single evaluation dataset, then applies the specified classifier model via the eval utility function to compute accuracy and other metrics. Supports pre-trained models ("gpt3", "chinese", "code") and custom-trained models, with optional custom tokenizers including sentencepiece models.

Usage

Use when you need to assess quality classifier model performance on held-out data before deploying classifiers for production data scoring.

Code Reference

Source Location

Signature

@logger.catch(reraise=True)
def main(
    positive_datasets=None,
    negative_datasets=None,
    model="my_quality_model",
    tokenizer=None,
    text_key="text"
):

Import

from data_juicer.tools.quality_classifier.eval import main

I/O Contract

Inputs

Name Type Required Description
positive_datasets str or List[str] No Paths to positive (high-quality) datasets, e.g. 'pos.parquet' or '["pos1.parquet", "pos2.parquet"]'
negative_datasets str or List[str] No Paths to negative (low-quality) datasets, e.g. 'neg.parquet' or '["neg1.parquet", "neg2.parquet"]'
model str No Quality classifier model name or path. Default: "my_quality_model". Built-in options: "gpt3", "chinese", "code"
tokenizer str No Tokenizer to use. Default: None (PySpark standard). Options: "zh.sp.model", "code.sp.model", or custom path
text_key str No Field key name holding texts to classify. Default: "text"

Outputs

Name Type Description
evaluation_metrics logged output Classification performance metrics (accuracy, etc.) logged via loguru

Usage Examples

# Run from command line using fire
# python data_juicer/tools/quality_classifier/eval.py \
#   --positive_datasets='["pos1.parquet", "pos2.parquet"]' \
#   --negative_datasets='["neg1.parquet", "neg2.parquet"]' \
#   --model=gpt3 \
#   --text_key=text

# Programmatic usage
from data_juicer.tools.quality_classifier.eval import main

main(
    positive_datasets=["pos_test.parquet"],
    negative_datasets=["neg_test.parquet"],
    model="gpt3",
    tokenizer=None,
    text_key="text"
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment