Implementation:Datajuicer Data juicer Quality Classifier Train
| Knowledge Sources | |
|---|---|
| Domains | Data_Quality, Tooling |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for training custom quality classifier models from positive and negative datasets provided by Data-Juicer.
Description
main (the train entry point) is a CLI tool and importable function that trains a binary quality classifier using PySpark ML pipelines. It accepts paths to positive (high-quality) and negative (low-quality) datasets, loads and labels them (label=1 for positive, label=0 for negative), merges and shuffles the data, optionally limits training sample count, splits into train and test sets using a configurable ratio, trains a HashingTF plus LogisticRegression pipeline via qc_utils.train, saves the resulting PipelineModel, and optionally evaluates it on the held-out test set using precision, recall, and F1 metrics.
Usage
Use when you need to create a domain-specific quality classifier tailored to your data, extending beyond the three pre-trained models (gpt3, chinese, code) shipped with Data-Juicer.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/tools/quality_classifier/train.py
Signature
@logger.catch(reraise=True)
def main(
positive_datasets,
negative_datasets,
output_model_path="my_quality_model",
num_training_samples=0,
train_test_split_ratio=0.8,
tokenizer=None,
evaluation=True,
text_key="text",
):
"""
Train a quality classifier using your own pos/neg datasets.
Args:
positive_datasets: path(s) to positive datasets (str or list of str).
negative_datasets: path(s) to negative datasets (str or list of str).
output_model_path: path to store the trained model. Default: "my_quality_model".
num_training_samples: number of samples to train with (0 = all). Default: 0.
train_test_split_ratio: ratio for train/test split. Default: 0.8.
tokenizer: sentencepiece tokenizer path or None for PySpark default.
evaluation: whether to evaluate after training. Default: True.
text_key: column name holding text content. Default: "text".
"""
Import
from data_juicer.tools.quality_classifier.train import main as train_quality_classifier
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| positive_datasets | str or list[str] | Yes | Path(s) to high-quality (positive) datasets |
| negative_datasets | str or list[str] | Yes | Path(s) to low-quality (negative) datasets |
| output_model_path | str | No | Directory to save the trained PipelineModel. Default: "my_quality_model" |
| num_training_samples | int | No | Max samples per class for training (0 = unlimited). Default: 0 |
| train_test_split_ratio | float | No | Train/test split ratio. Default: 0.8 |
| tokenizer | str or None | No | Sentencepiece model path, or None for PySpark standard Tokenizer |
| evaluation | bool | No | Whether to run evaluation after training. Default: True |
| text_key | str | No | Column name holding text content. Default: "text" |
Outputs
| Name | Type | Description |
|---|---|---|
| trained model | PipelineModel directory on disk | Saved PySpark PipelineModel at output_model_path |
| evaluation metrics | Log output | Precision, recall, and F1 printed to log when evaluation=True |
Usage Examples
CLI Usage
# Train a quality classifier from the command line
python -m data_juicer.tools.quality_classifier.train \
--positive_datasets '["wiki.parquet", "books.parquet"]' \
--negative_datasets '["cc_samples.parquet"]' \
--output_model_path ./models/my_quality_model \
--tokenizer zh.sp.model \
--evaluation True
Programmatic Usage
from data_juicer.tools.quality_classifier.train import main as train_quality_classifier
# Train a custom quality classifier
train_quality_classifier(
positive_datasets=['./data/high_quality.jsonl'],
negative_datasets=['./data/low_quality.jsonl'],
output_model_path='./models/custom_classifier',
num_training_samples=10000,
train_test_split_ratio=0.8,
evaluation=True
)