Workflow:Dagster io Dagster DSPy Optimization

Knowledge Sources	Dagster Dagster Docs DSPy Pipeline Example
Domains	LLMs, Prompt_Optimization, ML_Ops
Last Updated	2026-02-10 12:00 GMT

Overview

End-to-end process for building an AI reasoning system using DSPy with automated prompt optimization, custom evaluation metrics, and production monitoring, orchestrated by Dagster.

Description

This workflow demonstrates how to orchestrate a DSPy-based AI system lifecycle with Dagster. It ingests and validates puzzle data (NYT Connections), builds a Chain-of-Thought solver using DSPy's structured reasoning framework, applies MIPROv2 automatic optimization to improve solver performance through instruction tuning and few-shot example curation, evaluates the optimized model against custom metrics with quality gates, and establishes production monitoring with alert thresholds. The pipeline uses Dagster Components with YAML configuration for declarative definition management.

Usage

Execute this workflow when you need to build an LLM-powered reasoning system where prompt engineering alone is insufficient and you want automated, data-driven optimization of prompts and examples. This is appropriate for classification, reasoning, or structured output tasks where you have training data for evaluation. Requires DSPy and an LLM API backend (e.g., OpenAI).

Execution Steps

Step 1: Data Ingestion and Validation

Load and validate puzzle data from structured sources. Domain models (Puzzle, GameState) define the expected data schema. The pipeline loads puzzles from CSV, validates structural constraints (e.g., 16 words in 4 groups of 4), and splits data into training and evaluation sets with configurable ratios. Dagster Components with YAML configuration drive the data loading.

Key considerations:

Domain model classes enforce data schema validation at ingestion time
Configurable train/eval split ratios enable experimentation with data allocation
Dagster Components with YAML configuration simplify asset definition
Data validation failures surface as materialization errors in the Dagster UI

Step 2: DSPy Model Definition

Build a ConnectionsSolver using DSPy's Chain-of-Thought framework for structured multi-step reasoning. The solver takes game rules, available words, and guess history as structured inputs and produces categorized word groups as outputs. An iterative solving loop incorporates feedback from previous guesses to refine subsequent attempts.

Key considerations:

Chain-of-Thought (CoT) prompting enables the LLM to show reasoning steps
Structured input/output signatures define the solver's interface contract
The iterative loop with feedback mimics human problem-solving strategy
The DSPyResource manages LLM backend configuration centrally

Step 3: Automatic Optimization

Apply MIPROv2 automatic optimization to improve the solver's performance through systematic instruction tuning and few-shot example curation. The optimizer tests multiple instruction variants, selects the best-performing examples from training data, and compiles an optimized model. Quality gates prevent unnecessary computation when performance thresholds are already met.

Key considerations:

MIPROv2 optimizes both instructions and few-shot examples jointly
Quality gates check current performance before investing in optimization
The optimization asset depends on both the model and training data assets
Optimized model state is persisted as a Dagster asset for versioned tracking

Step 4: Evaluation and Monitoring

Evaluate the optimized solver against the held-out evaluation set using custom success metrics. The evaluation captures rich prediction data (puzzle ID, success/failure, attempt count, timing, groups solved). Alert thresholds (e.g., 65% accuracy) trigger production monitoring alerts. Deployment criteria gate whether the optimized model replaces the production version.

Key considerations:

Custom success metrics measure domain-specific performance (groups solved, attempt efficiency)
Rich prediction metadata enables detailed post-hoc analysis of failure modes
Alert thresholds provide early warning of model quality degradation
Deployment criteria enforce minimum quality standards before production promotion

Execution Diagram

GitHub URL

Workflow Repository