Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Huggingface Datatrove Synthetic Data Generation

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, LLM_Inference, Synthetic_Data
Last Updated 2026-02-14 17:00 GMT

Overview

End-to-end pipeline for generating synthetic training data at scale using LLM inference with vLLM or SGLang backends, including checkpointing, progress monitoring, and automatic dataset card generation.

Description

This workflow orchestrates large-scale synthetic data generation by running LLM inference across a corpus of input documents. It reads documents from a HuggingFace dataset, constructs prompts using configurable templates and system prompts, sends them to a vLLM or SGLang inference server for generation, and writes the results as Parquet files to a HuggingFace repository. The pipeline supports fault-tolerant execution through chunk-based checkpointing with a sqlite-backed request cache, enabling resumption after failures. It handles multi-node distributed inference, progress monitoring with live dataset card updates, and automatic dataset card generation upon completion.

Usage

Execute this workflow when you need to generate synthetic data from an existing dataset using an LLM (e.g., generating summaries, translations, or augmented training examples). Supports models from 1B to 1T+ parameters, local single-GPU execution, and multi-node Slurm clusters with tensor/pipeline/data parallelism.

Execution Steps

Step 1: Configure Input Dataset

Load the source dataset from HuggingFace Hub, specifying the dataset name, config, split, and the column containing prompts. Optionally apply a prompt template (with the DOCUMENT placeholder) and a system prompt to wrap each document's text into a structured chat message format.

Key considerations:

  • Prompt templates use DOCUMENT as the placeholder for the source text
  • System prompts are prepended as the first message in the chat format
  • Content is automatically truncated if it exceeds the model context budget

Step 2: Launch Inference Server

Start a local vLLM, SGLang, or custom inference server process. The server loads the model with configurable tensor parallelism, pipeline parallelism, quantization settings, and memory utilization. For multi-node setups, the server coordinates across nodes using Ray for distributed tensor parallelism.

Key considerations:

  • Server type selection: vLLM, SGLang, endpoint (external API), or custom
  • Tensor parallelism (TP) and pipeline parallelism (PP) for large models
  • GPU memory utilization, chunked prefill, and speculative decoding are configurable
  • Compilation lock manager prevents concurrent torch.compile cache corruption

Step 3: Execute Rollout Functions

For each input document, execute the rollout function asynchronously. The rollout constructs the API payload and calls the generate callback, which sends the request to the inference server. Multiple documents are processed concurrently, with configurable concurrency limits for both document processing and generation requests. Multiple rollouts per document are supported for generating multiple samples.

Key considerations:

  • Rollout functions are plain async callables with full control over prompt construction
  • Concurrent document processing and generation keep GPU utilization high
  • Shared context injects resources (process pools, temp dirs) into rollout functions

Step 4: Checkpoint and Cache Results

Write completed documents to local chunk files at configurable intervals. A sqlite-backed request cache deduplicates individual generation requests by payload hash, so completed generations are never re-sent during retries. On failure, the pipeline resumes from the last completed chunk.

Key considerations:

  • Chunk size (records_per_chunk) controls checkpoint granularity
  • Request cache requires xxhash and aiosqlite dependencies
  • Checkpoint files use the ${chunk_index} template variable in output filenames

Step 5: Write Output Data

Upload completed chunks as Parquet files to the HuggingFace Hub dataset repository (or local/S3 storage). Each chunk is uploaded incrementally as it completes, providing progressive data availability. Generation results are stored in document metadata under the rollout_results key.

Key considerations:

  • Parquet writer supports incremental uploads with configurable file sizes
  • Metadata expansion flattens nested rollout results into columns
  • Output is organized by rank and chunk index

Step 6: Generate Dataset Card

After inference completes, generate a HuggingFace dataset card (README.md) containing the model configuration, generation parameters, job statistics, and source dataset metadata. Upload the card to the output repository. Optionally, a progress monitor updates the card with a live progress bar and ETA during inference.

Key considerations:

  • Dataset card generation runs as a dependent Slurm job after inference
  • Progress monitoring runs in parallel as a separate lightweight job
  • The card includes generation statistics (tokens/s, total tokens, completion rate)

Execution Diagram

GitHub URL

Workflow Repository