Implementation:Unslothai Unsloth SyntheticDataKit
| Knowledge Sources | |
|---|---|
| Domains | Data_Preparation, NLP |
| Last Updated | 2026-02-07 08:40 GMT |
Overview
Concrete tool for generating synthetic QA training data from documents using a locally-served vLLM inference backend.
Description
The SyntheticDataKit class launches a vLLM inference server as a subprocess, loads a specified model, and provides methods for chunking input text files and preparing QA generation configurations. It handles subprocess lifecycle management with non-blocking pipe capture via the PipeCapture helper class, including graceful termination of the server process tree.
Usage
Import this class when you need to generate synthetic question-answer training data from raw text documents without external API access. It manages a local vLLM server for generation.
Code Reference
Source Location
- Repository: Unslothai_Unsloth
- File: unsloth/dataprep/synthetic.py
- Lines: 1-465
Signature
class SyntheticDataKit:
def __init__(
self,
model_name: str = "unsloth/Llama-3.1-8B-Instruct-unsloth-bnb-4bit",
max_seq_length: int = 2048,
gpu_memory_utilization: float = 0.98,
float8_kv_cache: bool = False,
conservativeness: float = 1.0,
token: str = None,
timeout: int = 1200,
**kwargs,
):
"""Launches vLLM server subprocess for synthetic data generation."""
@staticmethod
def from_pretrained(**kwargs) -> "SyntheticDataKit":
"""Factory method matching constructor signature."""
def chunk_data(self, filename: str = None) -> list:
"""Chunks text file into overlapping token-bounded segments."""
def prepare_qa_generation(self) -> None:
"""Sets up YAML config for QA generation workflow."""
def cleanup(self) -> None:
"""Terminates vLLM server process and frees resources."""
Import
from unsloth.dataprep import SyntheticDataKit
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_name | str | No | HuggingFace model identifier (default: Llama-3.1-8B-Instruct) |
| max_seq_length | int | No | Maximum context length (default: 2048) |
| gpu_memory_utilization | float | No | Fraction of GPU memory for vLLM (default: 0.98) |
| float8_kv_cache | bool | No | Use FP8 KV cache for memory savings (default: False) |
| conservativeness | float | No | Controls chunk overlap ratio (default: 1.0) |
| token | str | No | HuggingFace Hub authentication token |
| timeout | int | No | Server startup timeout in seconds (default: 1200) |
Outputs
| Name | Type | Description |
|---|---|---|
| chunk_data() returns | list | List of chunked text filenames |
| vLLM server | subprocess | Running inference server on localhost:8000 |
Usage Examples
Basic Synthetic Data Generation
from unsloth.dataprep import SyntheticDataKit
# Launch vLLM server with model
kit = SyntheticDataKit(
model_name="unsloth/Llama-3.1-8B-Instruct-unsloth-bnb-4bit",
max_seq_length=4096,
gpu_memory_utilization=0.90,
)
# Chunk input text file
chunks = kit.chunk_data(filename="input_document.txt")
# Prepare QA generation config
kit.prepare_qa_generation()
# Cleanup server
kit.cleanup()
Using Context Manager
from unsloth.dataprep import SyntheticDataKit
with SyntheticDataKit(
model_name="unsloth/Llama-3.1-8B-Instruct-unsloth-bnb-4bit",
max_seq_length=4096,
) as kit:
chunks = kit.chunk_data(filename="document.txt")
kit.prepare_qa_generation()
# Server automatically cleaned up on exit