Principle:Langgenius Dify Document Upload

Knowledge Sources	Domains	Last Updated
Dify	RAG, Knowledge_Management, Frontend	2026-02-12 00:00 GMT

Overview

Description

Document Upload is the process of ingesting source content into a Dify knowledge base (dataset) so that it can be chunked, embedded, and made available for retrieval. Documents are the primary unit of content within a dataset and can originate from multiple data sources: local file uploads, Notion workspace imports, or web crawling.

The upload process is tightly coupled with the document processing pipeline. When a document is created, the caller specifies not just the data source but also the full processing configuration: chunking mode, process rules, embedding model, and retrieval model. This declarative submission pattern means the entire processing intent is captured in a single request, allowing the backend to execute the ingestion pipeline asynchronously without further interaction.

Dify provides two distinct creation paths:

createDocument -- Adds a document to an existing dataset, identified by datasetId.
createFirstDocument -- Creates a dataset and its first document atomically via the /datasets/init endpoint, streamlining the common case where users upload a file as part of initial knowledge base setup.

Usage

File-based ingestion -- Upload PDFs, text files, Word documents, or other supported formats from the local filesystem.
Notion integration -- Import pages from a connected Notion workspace, with workspace and page identifiers specified in the data source configuration.
Web crawling -- Ingest content from websites using providers like Firecrawl, Jina Reader, or WaterCrawl, passing crawl job results as the data source.
Initial setup flow -- Use createFirstDocument to combine dataset creation and first document upload into a single step, reducing round-trips.

Theoretical Basis

Multi-Source Abstraction -- The DataSource type uses a discriminated union pattern (type: 'upload_file' | 'notion_import' | 'website_crawl') with corresponding info_list variants. This allows a single API surface to handle heterogeneous ingestion sources while maintaining type safety.
Declarative Pipeline Configuration -- By bundling process_rule, doc_form, embedding_model, and retrieval_model into the creation request (CreateDocumentReq), the system adopts a desired-state model. The backend interprets the full specification and orchestrates the pipeline stages (parsing, cleaning, splitting, indexing) without requiring incremental instructions.
Batch Processing -- The response includes a batch identifier, enabling callers to track the progress of multiple documents submitted together. This batch abstraction supports both single-document and bulk-import workflows.
Atomicity of First-Document Creation -- The /datasets/init endpoint wraps dataset creation and document upload in a single transaction, ensuring that a partially-created dataset without any documents cannot exist in the system when using the initial setup flow.

Related Pages

Implementation:Langgenius_Dify_CreateDocument

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment