Principle:Langgenius Dify Data Source Selection
| Knowledge Sources | |
|---|---|
| Domains | RAG Data Ingestion Knowledge Base |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Data source selection is the process of choosing and configuring how raw data enters a knowledge base, abstracting different ingestion origins behind a uniform dataset initialization interface.
Description
In any Retrieval-Augmented Generation (RAG) pipeline, the very first decision is where the data comes from. Data source selection defines the abstraction layer that allows a knowledge base to ingest content from heterogeneous origins -- local file uploads, third-party integrations such as Notion, web scraping endpoints, or programmatic API calls -- while presenting a consistent dataset object to downstream processing stages.
Dify models this as a two-phase operation:
- Dataset creation -- an empty dataset container is initialized with a name and optional metadata before any documents are added.
- Document attachment -- one or more documents from a chosen source type are attached to the dataset, each carrying source-specific metadata (file ID, Notion page ID, URL, etc.).
This separation is important because it decouples the identity of the knowledge base from the content that populates it. A single dataset can later receive documents from multiple source types, and the creation step can happen independently of document ingestion (for example, when a user wants to reserve a dataset name and configure settings before uploading files).
Usage
Use data source selection when:
- Initializing a new knowledge base -- the user begins the creation wizard and must first choose whether to upload files, connect Notion, or scrape a website.
- Adding documents to an existing knowledge base -- the same source abstraction applies when appending new content to a dataset that already contains documents.
- Programmatically provisioning datasets -- API consumers may create empty datasets first and populate them in a separate step, enabling batch or asynchronous ingestion workflows.
Theoretical Basis
Source Abstraction Model
The data source selection pattern follows a Strategy design. Each source type implements a common ingestion interface while encapsulating its own authentication, pagination, and format-handling logic:
DataSourceStrategy
+-- FileUploadStrategy (accepts multipart file uploads)
+-- NotionImportStrategy (reads pages via Notion API)
+-- WebScrapingStrategy (crawls URLs and extracts text)
At the platform level, the workflow proceeds as follows:
1. User selects source type (file | notion | web)
2. Platform creates an empty Dataset record
POST /datasets -> { id, name, created_at }
3. User configures source-specific parameters
(file list, Notion workspace + pages, target URLs)
4. Platform attaches documents to the Dataset
POST /datasets/{id}/documents -> Document[]
5. Documents enter the processing pipeline
(chunking, embedding, indexing)
Why Create an Empty Dataset First?
Creating the dataset before attaching documents provides several advantages:
- Atomicity -- the dataset ID is available immediately for client-side routing, progress tracking, and error recovery.
- Idempotency -- if document attachment fails (network error, quota exceeded), the dataset still exists and the user can retry without losing configuration.
- Flexibility -- settings such as indexing technique and retrieval configuration can be applied to the empty dataset before any documents are processed.
Considerations
- Source validation -- each source type should validate inputs early (file size limits, Notion permissions, URL reachability) before dataset creation to avoid orphaned empty datasets.
- Metadata propagation -- source-specific metadata (file name, Notion page title, URL) should flow through to the document and segment levels so that retrieval results can cite their origin.
- Rate limiting -- external sources (Notion API, web scraping) are subject to rate limits; the ingestion layer must handle retries and back-off gracefully.