Workflow:Datajuicer Data juicer LLM Powered Data Generation
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, LLM_Ops, Data_Synthesis |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
End-to-end process for using LLM-based operators and Ray vLLM pipeline integrations to generate, augment, and refine training data at scale within Data-Juicer.
Description
This workflow leverages Data-Juicer's LLM-powered mapper operators and Ray vLLM pipeline infrastructure to synthesize and improve training data. It covers generating question-answer pairs from text, calibrating and optimizing existing QA data, producing image/video captions, and running batch LLM inference through vLLM. The pipeline combines traditional data processing operators (for pre-filtering and post-filtering) with LLM-based operators that call language models via API or local inference. The Ray vLLM pipeline operators enable high-throughput batch inference by managing vLLM engine lifecycle within Ray Actors, supporting both text-only LLM and vision-language model inference.
Usage
Execute this workflow when you need to generate synthetic training data from existing text or multimodal sources, improve the quality of existing QA pairs through calibration, or enrich datasets with model-generated annotations (captions, tags, sentiment labels, intent classifications). Typical scenarios include building instruction-tuning datasets, creating domain-specific QA training data, generating preference pairs for RLHF, and producing captions for image/video datasets.
Execution Steps
Step 1: Prepare Source Data
Organize the source dataset with the fields required by the chosen generation operators. For QA generation from text, ensure a text field with source content. For QA optimization, provide query and response fields. For image/video captioning, include file paths in the appropriate media key fields. Pre-filter the source data using standard operators to ensure input quality.
Key considerations:
- Different generation operators expect different input field schemas
- Pre-filtering removes low-quality inputs that would produce poor generations
- Text length filtering ensures inputs are neither too short nor too long for generation
- Language filtering ensures inputs match the target language
Step 2: Configure LLM Backend
Set up the LLM inference backend. Data-Juicer supports API-based models (OpenAI-compatible endpoints) and local HuggingFace models. For API-based operators (generate_qa_from_text_mapper, calibrate_qa_mapper, optimize_qa_mapper), configure the API endpoint and model name. For Ray vLLM pipeline operators, specify the HuggingFace model name, accelerator type, and engine parameters.
Key considerations:
- API-based operators work with any OpenAI-compatible endpoint
- Ray vLLM pipelines require GPU accelerators and Ray cluster setup
- Sampling parameters (temperature, top_p, max_new_tokens) control generation quality
- Engine kwargs (max_model_len, tensor_parallel_size) control vLLM resource allocation
Step 3: Configure Generation Pipeline
Create a YAML configuration file combining pre-filtering operators, LLM generation operators, and post-filtering operators in sequence. The pre-filters ensure input quality, the LLM operators produce generated content, and post-filters validate the output quality.
Key considerations:
- Pre-filters run before LLM generation to reduce wasted inference
- Multiple generation operators can be chained (e.g., generate then calibrate)
- Post-filters can validate generated content quality (language, length, relevance)
- The text_key or query_key/response_key parameters control field routing
Step 4: Generate Content
Run the pipeline to execute LLM-based generation. QA generation operators extract questions and answers from source text. Calibration operators refine existing QA pairs for accuracy and clarity. Optimization operators improve prompt or response quality. Captioning operators generate descriptions for images or videos. The Ray vLLM pipeline manages model loading, batching, and inference automatically.
What happens:
- generate_qa_from_text_mapper: Extracts QA pairs from source text using LLM prompts
- generate_qa_from_examples_mapper: Generates new QA pairs following example patterns
- calibrate_qa_mapper: Refines existing QA pairs for accuracy
- optimize_qa_mapper: Improves QA pair quality through LLM rewriting
- llm_ray_vllm_engine_pipeline: Runs batch LLM inference via vLLM on Ray
- vlm_ray_vllm_engine_pipeline: Runs batch VLM inference for multimodal data
Step 5: Post Process and Filter
Apply quality filters to the generated content. Filter by language to ensure outputs match the target language. Filter by text length to remove too-short or too-long generations. Apply custom validation operators for domain-specific quality checks. Deduplicate to remove redundant generated samples.
Key considerations:
- Language ID filtering removes off-language generations
- Text length filtering catches degenerate outputs (empty or truncated)
- Deduplication removes near-identical generated content
- Custom validators can check format compliance (e.g., valid JSON, markdown structure)
Step 6: Export Generated Dataset
Export the enriched dataset with generated fields to the output path. The generated content is stored in new or existing fields alongside the original data, preserving provenance.
Key considerations:
- New fields (e.g., response, caption) are added to the sample schema
- Export supports JSONL, Parquet, and other formats
- Generated data can be merged back with original datasets for training
- Statistics from post-filtering can be retained for quality auditing