Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Datajuicer Data juicer LLM Powered Data Generation

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, LLM_Ops, Data_Synthesis
Last Updated 2026-02-14 16:00 GMT

Overview

End-to-end process for using LLM-based operators and Ray vLLM pipeline integrations to generate, augment, and refine training data at scale within Data-Juicer.

Description

This workflow leverages Data-Juicer's LLM-powered mapper operators and Ray vLLM pipeline infrastructure to synthesize and improve training data. It covers generating question-answer pairs from text, calibrating and optimizing existing QA data, producing image/video captions, and running batch LLM inference through vLLM. The pipeline combines traditional data processing operators (for pre-filtering and post-filtering) with LLM-based operators that call language models via API or local inference. The Ray vLLM pipeline operators enable high-throughput batch inference by managing vLLM engine lifecycle within Ray Actors, supporting both text-only LLM and vision-language model inference.

Usage

Execute this workflow when you need to generate synthetic training data from existing text or multimodal sources, improve the quality of existing QA pairs through calibration, or enrich datasets with model-generated annotations (captions, tags, sentiment labels, intent classifications). Typical scenarios include building instruction-tuning datasets, creating domain-specific QA training data, generating preference pairs for RLHF, and producing captions for image/video datasets.

Execution Steps

Step 1: Prepare Source Data

Organize the source dataset with the fields required by the chosen generation operators. For QA generation from text, ensure a text field with source content. For QA optimization, provide query and response fields. For image/video captioning, include file paths in the appropriate media key fields. Pre-filter the source data using standard operators to ensure input quality.

Key considerations:

  • Different generation operators expect different input field schemas
  • Pre-filtering removes low-quality inputs that would produce poor generations
  • Text length filtering ensures inputs are neither too short nor too long for generation
  • Language filtering ensures inputs match the target language

Step 2: Configure LLM Backend

Set up the LLM inference backend. Data-Juicer supports API-based models (OpenAI-compatible endpoints) and local HuggingFace models. For API-based operators (generate_qa_from_text_mapper, calibrate_qa_mapper, optimize_qa_mapper), configure the API endpoint and model name. For Ray vLLM pipeline operators, specify the HuggingFace model name, accelerator type, and engine parameters.

Key considerations:

  • API-based operators work with any OpenAI-compatible endpoint
  • Ray vLLM pipelines require GPU accelerators and Ray cluster setup
  • Sampling parameters (temperature, top_p, max_new_tokens) control generation quality
  • Engine kwargs (max_model_len, tensor_parallel_size) control vLLM resource allocation

Step 3: Configure Generation Pipeline

Create a YAML configuration file combining pre-filtering operators, LLM generation operators, and post-filtering operators in sequence. The pre-filters ensure input quality, the LLM operators produce generated content, and post-filters validate the output quality.

Key considerations:

  • Pre-filters run before LLM generation to reduce wasted inference
  • Multiple generation operators can be chained (e.g., generate then calibrate)
  • Post-filters can validate generated content quality (language, length, relevance)
  • The text_key or query_key/response_key parameters control field routing

Step 4: Generate Content

Run the pipeline to execute LLM-based generation. QA generation operators extract questions and answers from source text. Calibration operators refine existing QA pairs for accuracy and clarity. Optimization operators improve prompt or response quality. Captioning operators generate descriptions for images or videos. The Ray vLLM pipeline manages model loading, batching, and inference automatically.

What happens:

  • generate_qa_from_text_mapper: Extracts QA pairs from source text using LLM prompts
  • generate_qa_from_examples_mapper: Generates new QA pairs following example patterns
  • calibrate_qa_mapper: Refines existing QA pairs for accuracy
  • optimize_qa_mapper: Improves QA pair quality through LLM rewriting
  • llm_ray_vllm_engine_pipeline: Runs batch LLM inference via vLLM on Ray
  • vlm_ray_vllm_engine_pipeline: Runs batch VLM inference for multimodal data

Step 5: Post Process and Filter

Apply quality filters to the generated content. Filter by language to ensure outputs match the target language. Filter by text length to remove too-short or too-long generations. Apply custom validation operators for domain-specific quality checks. Deduplicate to remove redundant generated samples.

Key considerations:

  • Language ID filtering removes off-language generations
  • Text length filtering catches degenerate outputs (empty or truncated)
  • Deduplication removes near-identical generated content
  • Custom validators can check format compliance (e.g., valid JSON, markdown structure)

Step 6: Export Generated Dataset

Export the enriched dataset with generated fields to the output path. The generated content is stored in new or existing fields alongside the original data, preserving provenance.

Key considerations:

  • New fields (e.g., response, caption) are added to the sample schema
  • Export supports JSONL, Parquet, and other formats
  • Generated data can be merged back with original datasets for training
  • Statistics from post-filtering can be retained for quality auditing

Execution Diagram

GitHub URL

Workflow Repository