Workflow:Dagster io Dagster LLM Fine Tuning

Knowledge Sources	Dagster Dagster Docs LLM Fine-Tuning Example
Domains	LLMs, Fine_Tuning, ML_Ops
Last Updated	2026-02-10 12:00 GMT

Overview

End-to-end process for fine-tuning a large language model on domain-specific data using Dagster for orchestration and OpenAI for model training.

Description

This workflow demonstrates how to orchestrate a complete LLM fine-tuning pipeline with Dagster. It begins by ingesting semi-structured Goodreads book data into DuckDB, performs feature engineering to extract genre categories from nested JSON fields, creates JSONL training and validation files in OpenAI's expected format, submits a fine-tuning job to the OpenAI API, and validates the resulting model's performance against the base model using automated asset checks. The pipeline produces a specialized genre-classification model that outperforms the general-purpose base model on domain-specific tasks.

Usage

Execute this workflow when you have a domain-specific dataset and need to fine-tune an LLM (such as GPT-4o-mini) for a specialized classification or generation task. This is appropriate when the base model's accuracy on your domain is insufficient and you have labeled training data available. The workflow requires an OpenAI API key and sufficient API credits for fine-tuning.

Execution Steps

Step 1: Data Ingestion

Load raw datasets into DuckDB for processing. The pipeline ingests Goodreads book metadata (graphic novels and authors) from compressed JSON files using DuckDB's native JSON loading capabilities. Each data source is represented as a separate Dagster asset with a DuckDB resource for connection management.

Key considerations:

DuckDB handles JSON parsing natively without intermediate conversion
Each dataset maps to a separate asset for independent materialization
The DuckDB resource centralizes connection configuration across all assets

Step 2: Feature Engineering

Transform raw book data to extract and categorize genre labels suitable for model training. This step unpacks semi-structured fields (such as popular_shelves JSON arrays), identifies usable genre categories (fantasy, horror, manga, etc.), and creates an enriched dataset with clean genre labels linked to book descriptions.

Key considerations:

SQL-based transformations leverage DuckDB's JSON functions for nested field extraction
Genre filtering ensures only categories with sufficient training examples are included
The output dataset structure directly maps to the training format required downstream

Step 3: Training File Creation and Validation

Convert the prepared dataset into JSONL files formatted for OpenAI's fine-tuning API. Each training example is structured as a chatbot-style conversation with system instructions, user input (book description), and expected assistant output (genre classification). Separate training and validation files are created. Asset checks validate the files against OpenAI's format requirements.

Key considerations:

The JSONL format must match OpenAI's fine-tuning specification exactly
Validation checks verify message structure, token counts, and format compliance
Both training and validation splits are generated as separate assets
Reusable validation functions follow OpenAI's published format cookbook

Step 4: Fine_Tuning Job Execution

Upload training files to the OpenAI API and submit a fine-tuning job. The asset orchestrates the complete lifecycle: file upload, job creation, polling for completion, and recording the resulting model identifier. The fine-tuning uses the gpt-4o-mini base model and produces a specialized variant.

Key considerations:

File upload and job submission are sequential API operations
The asset polls for job completion, which may take minutes to hours
The resulting model name is recorded as asset metadata for downstream reference
Model versioning is tracked through Dagster's metadata system

Step 5: Model Validation

Compare the fine-tuned model's classification accuracy against the base model using a representative sample. An asset check runs both models on 100 test records, measures genre classification accuracy for each, and reports comparative metrics. The validation results are stored as metadata for historical tracking in the Dagster UI.

Key considerations:

Validation compares fine-tuned vs. base model on identical test data
The asset check uses additional_ins to access both the model asset and test data
Metrics are recorded as metadata for trend monitoring across training runs
Failed validation does not automatically roll back the model but provides visibility

Execution Diagram

GitHub URL

Workflow Repository