Workflow:Datajuicer Data juicer Text Data Processing Pipeline
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, LLM_Ops, Data_Cleaning |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
End-to-end process for cleaning and filtering raw text datasets into high-quality training data for large language models using Data-Juicer's YAML-configured operator pipeline.
Description
This workflow implements the primary data processing pipeline in Data-Juicer. It takes a raw dataset (JSONL, CSV, Parquet, or text), applies a configurable sequence of operators (filters, mappers, deduplicators, selectors) defined in a YAML configuration file, and exports a cleaned, filtered dataset. The pipeline supports text cleaning (removing HTML, emails, links), quality filtering (language detection, perplexity, repetition), content normalization (whitespace, unicode), and deduplication (exact or near-duplicate via MinHash/SimHash). The execution engine handles dataset loading, operator instantiation from the registry, optional operator fusion for performance, adaptive batch sizing, checkpointing for fault tolerance, and tracing for debugging.
Usage
Execute this workflow when you have a raw text dataset (e.g., web-scraped corpus, forum dumps, document collections) and need to produce a cleaned, filtered dataset suitable for LLM pre-training or fine-tuning. Typical inputs are JSONL files with a text field. The output is a processed dataset exported to JSONL, Parquet, or other formats at a specified path.
Execution Steps
Step 1: Define Configuration
Create a YAML configuration file specifying the project name, input dataset path, export path, number of worker processes, and the ordered list of operators to apply. Each operator entry includes the operator name (matching the registry key) and its parameters. Data-Juicer uses jsonargparse for configuration, supporting hierarchical configs, environment variables, and command-line overrides.
Key considerations:
- Operators are applied in the order they appear in the process list
- Each operator is identified by its registered name (e.g., text_length_filter, clean_links_mapper)
- Parameters can be overridden via CLI flags using dot notation
- The executor_type field defaults to default for single-machine processing
Step 2: Load Dataset
The DatasetBuilder loads the input dataset from the configured path. It supports multiple formats (JSONL, CSV, Parquet, TSV, plain text) and sources (local filesystem, S3, HuggingFace Hub). The loaded data is wrapped in a NestedDataset (extending HuggingFace Dataset) that supports nested field access and multimodal data references.
Key considerations:
- Dataset path can point to a single file or a directory of files
- The text_keys configuration controls which field contains the primary text
- Additional keys (image_key, video_key, audio_key) handle multimodal references
- Data validation can be enabled via the validators configuration
Step 3: Instantiate Operators
The load_ops function iterates through the process list, looks up each operator name in the OPERATORS registry, and instantiates it with the provided arguments. The registry pattern allows dynamic discovery of all built-in and custom operators. Custom operators can be loaded from external paths via the custom_operator_paths configuration.
Key considerations:
- Operators are registered via the @OPERATORS.register_module() decorator
- The operator type hierarchy includes Filter, Mapper, Deduplicator, Selector, Grouper, and Aggregator
- Each operator class inherits compute and process methods from its base class
Step 4: Optimize Execution
If enabled, operator fusion merges compatible consecutive operators (e.g., multiple filters sharing the same model) into fused operators that process data in a single pass. Adaptive batch sizing probes a small data sample to determine optimal batch sizes per operator based on available resources.
Key considerations:
- Operator fusion can yield 2-10x speedups for compatible operator chains
- The fusion_strategy can be set to probe for speed-based reordering
- Adaptive batch sizing adjusts per-operator batch sizes based on resource profiling
Step 5: Process Data
The executor runs the operator pipeline over the dataset. Each operator processes the data through its compute_stats (for filters) and process methods. Filters compute statistics first, then decide which samples to keep. Mappers transform samples in place. The pipeline supports checkpointing (saving intermediate state after each operator), tracing (recording sample-level changes), and monitoring (CPU/memory/disk usage).
Key considerations:
- Filters operate in two phases: compute_stats then process (keep/reject decision)
- Mappers transform sample content via process_single or process_batched
- Checkpointing enables resumption from the last completed operator on failure
- Tracing records which samples were modified or removed at each step
Step 6: Export Results
The Exporter writes the processed dataset to the configured output path. It supports multiple output formats (JSONL, Parquet), optional sharding for large datasets, parallel export, and S3 output destinations. Stats columns can optionally be retained in the output for downstream analysis.
Key considerations:
- Output format is inferred from the export_path suffix or set via export_type
- Shard size controls file splitting for large outputs
- The keep_stats_in_res_ds flag preserves computed statistics in the output
- Cache compression can be applied after export to reduce storage