Workflow:Dagster io Dagster RAG Pipeline
| Knowledge Sources | |
|---|---|
| Domains | LLMs, RAG, Data_Engineering |
| Last Updated | 2026-02-10 12:00 GMT |
Overview
End-to-end process for building a Retrieval-Augmented Generation (RAG) system that ingests data from multiple sources, generates embeddings, stores them in a vector database, and answers questions using retrieved context.
Description
This workflow constructs a complete RAG pipeline orchestrated by Dagster. It ingests knowledge from GitHub issues/discussions via GraphQL and documentation websites via sitemap scraping, converts text into LangChain Document format, generates vector embeddings using OpenAI's embedding models, stores them in a Pinecone vector database, and provides a retrieval-based question-answering interface powered by GPT-4. The pipeline uses weekly partitions with automation conditions for incremental updates and custom I/O managers for LangChain Document serialization.
Usage
Execute this workflow when you need to build a knowledge-based question-answering system that draws on multiple data sources (code repositories, documentation sites, discussion forums). This is appropriate when you want to enable natural language queries over a large, evolving knowledge base with automatic incremental updates. Requires OpenAI API access and a Pinecone vector database account.
Execution Steps
Step 1: Data Source Ingestion
Extract knowledge content from multiple sources into a unified document format. GitHub issues and discussions are fetched using GraphQL API queries over configurable date ranges. Documentation websites are scraped by parsing sitemap XML files and extracting page content. All sources are converted to LangChain Document objects with rich metadata (source URL, timestamps, content type).
Key considerations:
- A custom GithubResource wraps GraphQL API interactions for rate-limited access
- SitemapScraper resource uses BeautifulSoup for HTML-to-text conversion
- Weekly partitions enable incremental ingestion of new content
- LangChain Document format provides a unified representation across source types
Step 2: Embedding Generation
Convert ingested documents into vector embeddings suitable for similarity search. Each document's text content is processed through OpenAI's text-embedding-3-small model to produce dense vector representations. A custom I/O manager handles serialization of LangChain Document objects between asset materializations.
Key considerations:
- WeeklyPartitionsDefinition aligns embedding generation with source ingestion cadence
- AutomationCondition controls when embedding regeneration triggers
- The OpenAI embedding model (text-embedding-3-small) balances quality and cost
- Custom I/O manager handles non-standard LangChain Document serialization
Step 3: Vector Database Storage
Upload generated embeddings to a Pinecone vector database configured for similarity search. The database index is created with dimensions matching the embedding model output and configured with the appropriate distance metric. Embeddings are organized by namespaces corresponding to data sources for filtered retrieval.
Key considerations:
- Pinecone index dimensions must match the embedding model's output dimensionality
- Namespace separation enables source-specific or cross-source queries
- A PineconeResource wraps database client initialization and configuration
- Index creation is idempotent; existing indexes are reused
Step 4: Retrieval and Question Answering
Accept natural language queries, retrieve relevant context from the vector database, and generate grounded answers using a large language model. The query is first embedded using the same model as the ingestion pipeline, then used for similarity search in Pinecone. Retrieved context documents are assembled into a prompt for GPT-4-turbo-preview, which generates a contextually grounded answer.
Key considerations:
- Query embedding uses the same model (text-embedding-3-small) as the ingestion pipeline
- Similarity search supports namespace filtering for source-specific retrieval
- The retrieval asset uses Dagster Config for runtime query parameterization
- MaterializeResult records the answer and source documents as metadata