Workflow:Datajuicer Data juicer Text Data Processing Pipeline

Knowledge Sources	Data-Juicer Data-Juicer Docs Data-Juicer 2.0
Domains	Data_Engineering, LLM_Ops, Data_Cleaning
Last Updated	2026-02-14 16:00 GMT

Overview

End-to-end process for cleaning and filtering raw text datasets into high-quality training data for large language models using Data-Juicer's YAML-configured operator pipeline.

Description

This workflow implements the primary data processing pipeline in Data-Juicer. It takes a raw dataset (JSONL, CSV, Parquet, or text), applies a configurable sequence of operators (filters, mappers, deduplicators, selectors) defined in a YAML configuration file, and exports a cleaned, filtered dataset. The pipeline supports text cleaning (removing HTML, emails, links), quality filtering (language detection, perplexity, repetition), content normalization (whitespace, unicode), and deduplication (exact or near-duplicate via MinHash/SimHash). The execution engine handles dataset loading, operator instantiation from the registry, optional operator fusion for performance, adaptive batch sizing, checkpointing for fault tolerance, and tracing for debugging.

Usage

Execute this workflow when you have a raw text dataset (e.g., web-scraped corpus, forum dumps, document collections) and need to produce a cleaned, filtered dataset suitable for LLM pre-training or fine-tuning. Typical inputs are JSONL files with a text field. The output is a processed dataset exported to JSONL, Parquet, or other formats at a specified path.

Execution Steps

Step 1: Define Configuration

Create a YAML configuration file specifying the project name, input dataset path, export path, number of worker processes, and the ordered list of operators to apply. Each operator entry includes the operator name (matching the registry key) and its parameters. Data-Juicer uses jsonargparse for configuration, supporting hierarchical configs, environment variables, and command-line overrides.

Key considerations:

Operators are applied in the order they appear in the process list
Each operator is identified by its registered name (e.g., text_length_filter, clean_links_mapper)
Parameters can be overridden via CLI flags using dot notation
The executor_type field defaults to default for single-machine processing

Step 2: Load Dataset

The DatasetBuilder loads the input dataset from the configured path. It supports multiple formats (JSONL, CSV, Parquet, TSV, plain text) and sources (local filesystem, S3, HuggingFace Hub). The loaded data is wrapped in a NestedDataset (extending HuggingFace Dataset) that supports nested field access and multimodal data references.

Key considerations:

Dataset path can point to a single file or a directory of files
The text_keys configuration controls which field contains the primary text
Additional keys (image_key, video_key, audio_key) handle multimodal references
Data validation can be enabled via the validators configuration

Step 3: Instantiate Operators

The load_ops function iterates through the process list, looks up each operator name in the OPERATORS registry, and instantiates it with the provided arguments. The registry pattern allows dynamic discovery of all built-in and custom operators. Custom operators can be loaded from external paths via the custom_operator_paths configuration.

Key considerations:

Operators are registered via the @OPERATORS.register_module() decorator
The operator type hierarchy includes Filter, Mapper, Deduplicator, Selector, Grouper, and Aggregator
Each operator class inherits compute and process methods from its base class

Step 4: Optimize Execution

If enabled, operator fusion merges compatible consecutive operators (e.g., multiple filters sharing the same model) into fused operators that process data in a single pass. Adaptive batch sizing probes a small data sample to determine optimal batch sizes per operator based on available resources.

Key considerations:

Operator fusion can yield 2-10x speedups for compatible operator chains
The fusion_strategy can be set to probe for speed-based reordering
Adaptive batch sizing adjusts per-operator batch sizes based on resource profiling

Step 5: Process Data

The executor runs the operator pipeline over the dataset. Each operator processes the data through its compute_stats (for filters) and process methods. Filters compute statistics first, then decide which samples to keep. Mappers transform samples in place. The pipeline supports checkpointing (saving intermediate state after each operator), tracing (recording sample-level changes), and monitoring (CPU/memory/disk usage).

Key considerations:

Filters operate in two phases: compute_stats then process (keep/reject decision)
Mappers transform sample content via process_single or process_batched
Checkpointing enables resumption from the last completed operator on failure
Tracing records which samples were modified or removed at each step

Step 6: Export Results

The Exporter writes the processed dataset to the configured output path. It supports multiple output formats (JSONL, Parquet), optional sharding for large datasets, parallel export, and S3 output destinations. Stats columns can optionally be retained in the output for downstream analysis.

Key considerations:

Output format is inferred from the export_path suffix or set via export_type
Shard size controls file splitting for large outputs
The keep_stats_in_res_ds flag preserves computed statistics in the output
Cache compression can be applied after export to reduce storage

Execution Diagram

GitHub URL

Workflow Repository