Workflow:Dagster io Dagster LLM Fine Tuning
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Fine_Tuning, ML_Ops |
| Last Updated | 2026-02-10 12:00 GMT |
Overview
End-to-end process for fine-tuning a large language model on domain-specific data using Dagster for orchestration and OpenAI for model training.
Description
This workflow demonstrates how to orchestrate a complete LLM fine-tuning pipeline with Dagster. It begins by ingesting semi-structured Goodreads book data into DuckDB, performs feature engineering to extract genre categories from nested JSON fields, creates JSONL training and validation files in OpenAI's expected format, submits a fine-tuning job to the OpenAI API, and validates the resulting model's performance against the base model using automated asset checks. The pipeline produces a specialized genre-classification model that outperforms the general-purpose base model on domain-specific tasks.
Usage
Execute this workflow when you have a domain-specific dataset and need to fine-tune an LLM (such as GPT-4o-mini) for a specialized classification or generation task. This is appropriate when the base model's accuracy on your domain is insufficient and you have labeled training data available. The workflow requires an OpenAI API key and sufficient API credits for fine-tuning.
Execution Steps
Step 1: Data Ingestion
Load raw datasets into DuckDB for processing. The pipeline ingests Goodreads book metadata (graphic novels and authors) from compressed JSON files using DuckDB's native JSON loading capabilities. Each data source is represented as a separate Dagster asset with a DuckDB resource for connection management.
Key considerations:
- DuckDB handles JSON parsing natively without intermediate conversion
- Each dataset maps to a separate asset for independent materialization
- The DuckDB resource centralizes connection configuration across all assets
Step 2: Feature Engineering
Transform raw book data to extract and categorize genre labels suitable for model training. This step unpacks semi-structured fields (such as popular_shelves JSON arrays), identifies usable genre categories (fantasy, horror, manga, etc.), and creates an enriched dataset with clean genre labels linked to book descriptions.
Key considerations:
- SQL-based transformations leverage DuckDB's JSON functions for nested field extraction
- Genre filtering ensures only categories with sufficient training examples are included
- The output dataset structure directly maps to the training format required downstream
Step 3: Training File Creation and Validation
Convert the prepared dataset into JSONL files formatted for OpenAI's fine-tuning API. Each training example is structured as a chatbot-style conversation with system instructions, user input (book description), and expected assistant output (genre classification). Separate training and validation files are created. Asset checks validate the files against OpenAI's format requirements.
Key considerations:
- The JSONL format must match OpenAI's fine-tuning specification exactly
- Validation checks verify message structure, token counts, and format compliance
- Both training and validation splits are generated as separate assets
- Reusable validation functions follow OpenAI's published format cookbook
Step 4: Fine_Tuning Job Execution
Upload training files to the OpenAI API and submit a fine-tuning job. The asset orchestrates the complete lifecycle: file upload, job creation, polling for completion, and recording the resulting model identifier. The fine-tuning uses the gpt-4o-mini base model and produces a specialized variant.
Key considerations:
- File upload and job submission are sequential API operations
- The asset polls for job completion, which may take minutes to hours
- The resulting model name is recorded as asset metadata for downstream reference
- Model versioning is tracked through Dagster's metadata system
Step 5: Model Validation
Compare the fine-tuned model's classification accuracy against the base model using a representative sample. An asset check runs both models on 100 test records, measures genre classification accuracy for each, and reports comparative metrics. The validation results are stored as metadata for historical tracking in the Dagster UI.
Key considerations:
- Validation compares fine-tuned vs. base model on identical test data
- The asset check uses additional_ins to access both the model asset and test data
- Metrics are recorded as metadata for trend monitoring across training runs
- Failed validation does not automatically roll back the model but provides visibility