Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Langgenius Dify Data Source Selection

From Leeroopedia


Knowledge Sources
Domains RAG Data Ingestion Knowledge Base
Last Updated 2026-02-08 00:00 GMT

Overview

Data source selection is the process of choosing and configuring how raw data enters a knowledge base, abstracting different ingestion origins behind a uniform dataset initialization interface.

Description

In any Retrieval-Augmented Generation (RAG) pipeline, the very first decision is where the data comes from. Data source selection defines the abstraction layer that allows a knowledge base to ingest content from heterogeneous origins -- local file uploads, third-party integrations such as Notion, web scraping endpoints, or programmatic API calls -- while presenting a consistent dataset object to downstream processing stages.

Dify models this as a two-phase operation:

  1. Dataset creation -- an empty dataset container is initialized with a name and optional metadata before any documents are added.
  2. Document attachment -- one or more documents from a chosen source type are attached to the dataset, each carrying source-specific metadata (file ID, Notion page ID, URL, etc.).

This separation is important because it decouples the identity of the knowledge base from the content that populates it. A single dataset can later receive documents from multiple source types, and the creation step can happen independently of document ingestion (for example, when a user wants to reserve a dataset name and configure settings before uploading files).

Usage

Use data source selection when:

  • Initializing a new knowledge base -- the user begins the creation wizard and must first choose whether to upload files, connect Notion, or scrape a website.
  • Adding documents to an existing knowledge base -- the same source abstraction applies when appending new content to a dataset that already contains documents.
  • Programmatically provisioning datasets -- API consumers may create empty datasets first and populate them in a separate step, enabling batch or asynchronous ingestion workflows.

Theoretical Basis

Source Abstraction Model

The data source selection pattern follows a Strategy design. Each source type implements a common ingestion interface while encapsulating its own authentication, pagination, and format-handling logic:

DataSourceStrategy
  +-- FileUploadStrategy        (accepts multipart file uploads)
  +-- NotionImportStrategy      (reads pages via Notion API)
  +-- WebScrapingStrategy       (crawls URLs and extracts text)

At the platform level, the workflow proceeds as follows:

1.  User selects source type (file | notion | web)
2.  Platform creates an empty Dataset record
       POST /datasets  ->  { id, name, created_at }
3.  User configures source-specific parameters
       (file list, Notion workspace + pages, target URLs)
4.  Platform attaches documents to the Dataset
       POST /datasets/{id}/documents  ->  Document[]
5.  Documents enter the processing pipeline
       (chunking, embedding, indexing)

Why Create an Empty Dataset First?

Creating the dataset before attaching documents provides several advantages:

  • Atomicity -- the dataset ID is available immediately for client-side routing, progress tracking, and error recovery.
  • Idempotency -- if document attachment fails (network error, quota exceeded), the dataset still exists and the user can retry without losing configuration.
  • Flexibility -- settings such as indexing technique and retrieval configuration can be applied to the empty dataset before any documents are processed.

Considerations

  • Source validation -- each source type should validate inputs early (file size limits, Notion permissions, URL reachability) before dataset creation to avoid orphaned empty datasets.
  • Metadata propagation -- source-specific metadata (file name, Notion page title, URL) should flow through to the document and segment levels so that retrieval results can cite their origin.
  • Rate limiting -- external sources (Notion API, web scraping) are subject to rate limits; the ingestion layer must handle retries and back-off gracefully.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment