Workflow:Dagster io Dagster RAG Pipeline

Knowledge Sources	Dagster Dagster Docs RAG Pipeline Example
Domains	LLMs, RAG, Data_Engineering
Last Updated	2026-02-10 12:00 GMT

Overview

End-to-end process for building a Retrieval-Augmented Generation (RAG) system that ingests data from multiple sources, generates embeddings, stores them in a vector database, and answers questions using retrieved context.

Description

This workflow constructs a complete RAG pipeline orchestrated by Dagster. It ingests knowledge from GitHub issues/discussions via GraphQL and documentation websites via sitemap scraping, converts text into LangChain Document format, generates vector embeddings using OpenAI's embedding models, stores them in a Pinecone vector database, and provides a retrieval-based question-answering interface powered by GPT-4. The pipeline uses weekly partitions with automation conditions for incremental updates and custom I/O managers for LangChain Document serialization.

Usage

Execute this workflow when you need to build a knowledge-based question-answering system that draws on multiple data sources (code repositories, documentation sites, discussion forums). This is appropriate when you want to enable natural language queries over a large, evolving knowledge base with automatic incremental updates. Requires OpenAI API access and a Pinecone vector database account.

Execution Steps

Step 1: Data Source Ingestion

Extract knowledge content from multiple sources into a unified document format. GitHub issues and discussions are fetched using GraphQL API queries over configurable date ranges. Documentation websites are scraped by parsing sitemap XML files and extracting page content. All sources are converted to LangChain Document objects with rich metadata (source URL, timestamps, content type).

Key considerations:

A custom GithubResource wraps GraphQL API interactions for rate-limited access
SitemapScraper resource uses BeautifulSoup for HTML-to-text conversion
Weekly partitions enable incremental ingestion of new content
LangChain Document format provides a unified representation across source types

Step 2: Embedding Generation

Convert ingested documents into vector embeddings suitable for similarity search. Each document's text content is processed through OpenAI's text-embedding-3-small model to produce dense vector representations. A custom I/O manager handles serialization of LangChain Document objects between asset materializations.

Key considerations:

WeeklyPartitionsDefinition aligns embedding generation with source ingestion cadence
AutomationCondition controls when embedding regeneration triggers
The OpenAI embedding model (text-embedding-3-small) balances quality and cost
Custom I/O manager handles non-standard LangChain Document serialization

Step 3: Vector Database Storage

Upload generated embeddings to a Pinecone vector database configured for similarity search. The database index is created with dimensions matching the embedding model output and configured with the appropriate distance metric. Embeddings are organized by namespaces corresponding to data sources for filtered retrieval.

Key considerations:

Pinecone index dimensions must match the embedding model's output dimensionality
Namespace separation enables source-specific or cross-source queries
A PineconeResource wraps database client initialization and configuration
Index creation is idempotent; existing indexes are reused

Step 4: Retrieval and Question Answering

Accept natural language queries, retrieve relevant context from the vector database, and generate grounded answers using a large language model. The query is first embedded using the same model as the ingestion pipeline, then used for similarity search in Pinecone. Retrieved context documents are assembled into a prompt for GPT-4-turbo-preview, which generates a contextually grounded answer.

Key considerations:

Query embedding uses the same model (text-embedding-3-small) as the ingestion pipeline
Similarity search supports namespace filtering for source-specific retrieval
The retrieval asset uses Dagster Config for runtime query parameterization
MaterializeResult records the answer and source documents as metadata

Execution Diagram

GitHub URL

Workflow Repository