Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Unslothai Unsloth SyntheticDataKit

From Leeroopedia


Knowledge Sources
Domains Data_Preparation, NLP
Last Updated 2026-02-07 08:40 GMT

Overview

Concrete tool for generating synthetic QA training data from documents using a locally-served vLLM inference backend.

Description

The SyntheticDataKit class launches a vLLM inference server as a subprocess, loads a specified model, and provides methods for chunking input text files and preparing QA generation configurations. It handles subprocess lifecycle management with non-blocking pipe capture via the PipeCapture helper class, including graceful termination of the server process tree.

Usage

Import this class when you need to generate synthetic question-answer training data from raw text documents without external API access. It manages a local vLLM server for generation.

Code Reference

Source Location

Signature

class SyntheticDataKit:
    def __init__(
        self,
        model_name: str = "unsloth/Llama-3.1-8B-Instruct-unsloth-bnb-4bit",
        max_seq_length: int = 2048,
        gpu_memory_utilization: float = 0.98,
        float8_kv_cache: bool = False,
        conservativeness: float = 1.0,
        token: str = None,
        timeout: int = 1200,
        **kwargs,
    ):
        """Launches vLLM server subprocess for synthetic data generation."""

    @staticmethod
    def from_pretrained(**kwargs) -> "SyntheticDataKit":
        """Factory method matching constructor signature."""

    def chunk_data(self, filename: str = None) -> list:
        """Chunks text file into overlapping token-bounded segments."""

    def prepare_qa_generation(self) -> None:
        """Sets up YAML config for QA generation workflow."""

    def cleanup(self) -> None:
        """Terminates vLLM server process and frees resources."""

Import

from unsloth.dataprep import SyntheticDataKit

I/O Contract

Inputs

Name Type Required Description
model_name str No HuggingFace model identifier (default: Llama-3.1-8B-Instruct)
max_seq_length int No Maximum context length (default: 2048)
gpu_memory_utilization float No Fraction of GPU memory for vLLM (default: 0.98)
float8_kv_cache bool No Use FP8 KV cache for memory savings (default: False)
conservativeness float No Controls chunk overlap ratio (default: 1.0)
token str No HuggingFace Hub authentication token
timeout int No Server startup timeout in seconds (default: 1200)

Outputs

Name Type Description
chunk_data() returns list List of chunked text filenames
vLLM server subprocess Running inference server on localhost:8000

Usage Examples

Basic Synthetic Data Generation

from unsloth.dataprep import SyntheticDataKit

# Launch vLLM server with model
kit = SyntheticDataKit(
    model_name="unsloth/Llama-3.1-8B-Instruct-unsloth-bnb-4bit",
    max_seq_length=4096,
    gpu_memory_utilization=0.90,
)

# Chunk input text file
chunks = kit.chunk_data(filename="input_document.txt")

# Prepare QA generation config
kit.prepare_qa_generation()

# Cleanup server
kit.cleanup()

Using Context Manager

from unsloth.dataprep import SyntheticDataKit

with SyntheticDataKit(
    model_name="unsloth/Llama-3.1-8B-Instruct-unsloth-bnb-4bit",
    max_seq_length=4096,
) as kit:
    chunks = kit.chunk_data(filename="document.txt")
    kit.prepare_qa_generation()
# Server automatically cleaned up on exit

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment