Workflow:Huggingface Datatrove Synthetic Data Generation
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, LLM_Inference, Synthetic_Data |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
End-to-end pipeline for generating synthetic training data at scale using LLM inference with vLLM or SGLang backends, including checkpointing, progress monitoring, and automatic dataset card generation.
Description
This workflow orchestrates large-scale synthetic data generation by running LLM inference across a corpus of input documents. It reads documents from a HuggingFace dataset, constructs prompts using configurable templates and system prompts, sends them to a vLLM or SGLang inference server for generation, and writes the results as Parquet files to a HuggingFace repository. The pipeline supports fault-tolerant execution through chunk-based checkpointing with a sqlite-backed request cache, enabling resumption after failures. It handles multi-node distributed inference, progress monitoring with live dataset card updates, and automatic dataset card generation upon completion.
Usage
Execute this workflow when you need to generate synthetic data from an existing dataset using an LLM (e.g., generating summaries, translations, or augmented training examples). Supports models from 1B to 1T+ parameters, local single-GPU execution, and multi-node Slurm clusters with tensor/pipeline/data parallelism.
Execution Steps
Step 1: Configure Input Dataset
Load the source dataset from HuggingFace Hub, specifying the dataset name, config, split, and the column containing prompts. Optionally apply a prompt template (with the DOCUMENT placeholder) and a system prompt to wrap each document's text into a structured chat message format.
Key considerations:
- Prompt templates use DOCUMENT as the placeholder for the source text
- System prompts are prepended as the first message in the chat format
- Content is automatically truncated if it exceeds the model context budget
Step 2: Launch Inference Server
Start a local vLLM, SGLang, or custom inference server process. The server loads the model with configurable tensor parallelism, pipeline parallelism, quantization settings, and memory utilization. For multi-node setups, the server coordinates across nodes using Ray for distributed tensor parallelism.
Key considerations:
- Server type selection: vLLM, SGLang, endpoint (external API), or custom
- Tensor parallelism (TP) and pipeline parallelism (PP) for large models
- GPU memory utilization, chunked prefill, and speculative decoding are configurable
- Compilation lock manager prevents concurrent torch.compile cache corruption
Step 3: Execute Rollout Functions
For each input document, execute the rollout function asynchronously. The rollout constructs the API payload and calls the generate callback, which sends the request to the inference server. Multiple documents are processed concurrently, with configurable concurrency limits for both document processing and generation requests. Multiple rollouts per document are supported for generating multiple samples.
Key considerations:
- Rollout functions are plain async callables with full control over prompt construction
- Concurrent document processing and generation keep GPU utilization high
- Shared context injects resources (process pools, temp dirs) into rollout functions
Step 4: Checkpoint and Cache Results
Write completed documents to local chunk files at configurable intervals. A sqlite-backed request cache deduplicates individual generation requests by payload hash, so completed generations are never re-sent during retries. On failure, the pipeline resumes from the last completed chunk.
Key considerations:
- Chunk size (records_per_chunk) controls checkpoint granularity
- Request cache requires xxhash and aiosqlite dependencies
- Checkpoint files use the ${chunk_index} template variable in output filenames
Step 5: Write Output Data
Upload completed chunks as Parquet files to the HuggingFace Hub dataset repository (or local/S3 storage). Each chunk is uploaded incrementally as it completes, providing progressive data availability. Generation results are stored in document metadata under the rollout_results key.
Key considerations:
- Parquet writer supports incremental uploads with configurable file sizes
- Metadata expansion flattens nested rollout results into columns
- Output is organized by rank and chunk index
Step 6: Generate Dataset Card
After inference completes, generate a HuggingFace dataset card (README.md) containing the model configuration, generation parameters, job statistics, and source dataset metadata. Upload the card to the output repository. Optionally, a progress monitor updates the card with a live progress bar and ETA during inference.
Key considerations:
- Dataset card generation runs as a dependent Slurm job after inference
- Progress monitoring runs in parallel as a separate lightweight job
- The card includes generation statistics (tokens/s, total tokens, completion rate)