Implementation:Datajuicer Data juicer Quality Classifier Eval
| Knowledge Sources | |
|---|---|
| Domains | Tooling |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for evaluating trained quality classifier models against labeled datasets provided by Data-Juicer.
Description
Quality_Classifier_Eval is a CLI tool for evaluating a trained quality classifier model against labeled positive and negative datasets, reporting classification performance metrics. It initializes a PySpark session, loads positive datasets (label=1) and negative datasets (label=0) using qc_utils, merges them into a single evaluation dataset, then applies the specified classifier model via the eval utility function to compute accuracy and other metrics. Supports pre-trained models ("gpt3", "chinese", "code") and custom-trained models, with optional custom tokenizers including sentencepiece models.
Usage
Use when you need to assess quality classifier model performance on held-out data before deploying classifiers for production data scoring.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File:
data_juicer/tools/quality_classifier/eval.py
Signature
@logger.catch(reraise=True)
def main(
positive_datasets=None,
negative_datasets=None,
model="my_quality_model",
tokenizer=None,
text_key="text"
):
Import
from data_juicer.tools.quality_classifier.eval import main
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| positive_datasets | str or List[str] | No | Paths to positive (high-quality) datasets, e.g. 'pos.parquet' or '["pos1.parquet", "pos2.parquet"]' |
| negative_datasets | str or List[str] | No | Paths to negative (low-quality) datasets, e.g. 'neg.parquet' or '["neg1.parquet", "neg2.parquet"]' |
| model | str | No | Quality classifier model name or path. Default: "my_quality_model". Built-in options: "gpt3", "chinese", "code" |
| tokenizer | str | No | Tokenizer to use. Default: None (PySpark standard). Options: "zh.sp.model", "code.sp.model", or custom path |
| text_key | str | No | Field key name holding texts to classify. Default: "text" |
Outputs
| Name | Type | Description |
|---|---|---|
| evaluation_metrics | logged output | Classification performance metrics (accuracy, etc.) logged via loguru |
Usage Examples
# Run from command line using fire
# python data_juicer/tools/quality_classifier/eval.py \
# --positive_datasets='["pos1.parquet", "pos2.parquet"]' \
# --negative_datasets='["neg1.parquet", "neg2.parquet"]' \
# --model=gpt3 \
# --text_key=text
# Programmatic usage
from data_juicer.tools.quality_classifier.eval import main
main(
positive_datasets=["pos_test.parquet"],
negative_datasets=["neg_test.parquet"],
model="gpt3",
tokenizer=None,
text_key="text"
)