Implementation:Datajuicer Data juicer Quality Classifier Train

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data_Quality, Tooling
Last Updated	2026-02-14 16:00 GMT

Overview

Concrete tool for training custom quality classifier models from positive and negative datasets provided by Data-Juicer.

Description

main (the train entry point) is a CLI tool and importable function that trains a binary quality classifier using PySpark ML pipelines. It accepts paths to positive (high-quality) and negative (low-quality) datasets, loads and labels them (label=1 for positive, label=0 for negative), merges and shuffles the data, optionally limits training sample count, splits into train and test sets using a configurable ratio, trains a HashingTF plus LogisticRegression pipeline via qc_utils.train, saves the resulting PipelineModel, and optionally evaluates it on the held-out test set using precision, recall, and F1 metrics.

Usage

Use when you need to create a domain-specific quality classifier tailored to your data, extending beyond the three pre-trained models (gpt3, chinese, code) shipped with Data-Juicer.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/tools/quality_classifier/train.py

Signature

@logger.catch(reraise=True)
def main(
    positive_datasets,
    negative_datasets,
    output_model_path="my_quality_model",
    num_training_samples=0,
    train_test_split_ratio=0.8,
    tokenizer=None,
    evaluation=True,
    text_key="text",
):
    """
    Train a quality classifier using your own pos/neg datasets.

    Args:
        positive_datasets: path(s) to positive datasets (str or list of str).
        negative_datasets: path(s) to negative datasets (str or list of str).
        output_model_path: path to store the trained model. Default: "my_quality_model".
        num_training_samples: number of samples to train with (0 = all). Default: 0.
        train_test_split_ratio: ratio for train/test split. Default: 0.8.
        tokenizer: sentencepiece tokenizer path or None for PySpark default.
        evaluation: whether to evaluate after training. Default: True.
        text_key: column name holding text content. Default: "text".
    """

Import

from data_juicer.tools.quality_classifier.train import main as train_quality_classifier

I/O Contract

Inputs

Name	Type	Required	Description
positive_datasets	str or list[str]	Yes	Path(s) to high-quality (positive) datasets
negative_datasets	str or list[str]	Yes	Path(s) to low-quality (negative) datasets
output_model_path	str	No	Directory to save the trained PipelineModel. Default: "my_quality_model"
num_training_samples	int	No	Max samples per class for training (0 = unlimited). Default: 0
train_test_split_ratio	float	No	Train/test split ratio. Default: 0.8
tokenizer	str or None	No	Sentencepiece model path, or None for PySpark standard Tokenizer
evaluation	bool	No	Whether to run evaluation after training. Default: True
text_key	str	No	Column name holding text content. Default: "text"

Outputs

Name	Type	Description
trained model	PipelineModel directory on disk	Saved PySpark PipelineModel at output_model_path
evaluation metrics	Log output	Precision, recall, and F1 printed to log when evaluation=True

Usage Examples

CLI Usage

# Train a quality classifier from the command line
python -m data_juicer.tools.quality_classifier.train \
    --positive_datasets '["wiki.parquet", "books.parquet"]' \
    --negative_datasets '["cc_samples.parquet"]' \
    --output_model_path ./models/my_quality_model \
    --tokenizer zh.sp.model \
    --evaluation True

Programmatic Usage

from data_juicer.tools.quality_classifier.train import main as train_quality_classifier

# Train a custom quality classifier
train_quality_classifier(
    positive_datasets=['./data/high_quality.jsonl'],
    negative_datasets=['./data/low_quality.jsonl'],
    output_model_path='./models/custom_classifier',
    num_training_samples=10000,
    train_test_split_ratio=0.8,
    evaluation=True
)

Related Pages

Requires Environment

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment