Principle:Langgenius Dify Document Upload
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| Dify | RAG, Knowledge_Management, Frontend | 2026-02-12 00:00 GMT |
Overview
Description
Document Upload is the process of ingesting source content into a Dify knowledge base (dataset) so that it can be chunked, embedded, and made available for retrieval. Documents are the primary unit of content within a dataset and can originate from multiple data sources: local file uploads, Notion workspace imports, or web crawling.
The upload process is tightly coupled with the document processing pipeline. When a document is created, the caller specifies not just the data source but also the full processing configuration: chunking mode, process rules, embedding model, and retrieval model. This declarative submission pattern means the entire processing intent is captured in a single request, allowing the backend to execute the ingestion pipeline asynchronously without further interaction.
Dify provides two distinct creation paths:
- createDocument -- Adds a document to an existing dataset, identified by
datasetId. - createFirstDocument -- Creates a dataset and its first document atomically via the
/datasets/initendpoint, streamlining the common case where users upload a file as part of initial knowledge base setup.
Usage
- File-based ingestion -- Upload PDFs, text files, Word documents, or other supported formats from the local filesystem.
- Notion integration -- Import pages from a connected Notion workspace, with workspace and page identifiers specified in the data source configuration.
- Web crawling -- Ingest content from websites using providers like Firecrawl, Jina Reader, or WaterCrawl, passing crawl job results as the data source.
- Initial setup flow -- Use
createFirstDocumentto combine dataset creation and first document upload into a single step, reducing round-trips.
Theoretical Basis
- Multi-Source Abstraction -- The
DataSourcetype uses a discriminated union pattern (type: 'upload_file' | 'notion_import' | 'website_crawl') with correspondinginfo_listvariants. This allows a single API surface to handle heterogeneous ingestion sources while maintaining type safety. - Declarative Pipeline Configuration -- By bundling
process_rule,doc_form,embedding_model, andretrieval_modelinto the creation request (CreateDocumentReq), the system adopts a desired-state model. The backend interprets the full specification and orchestrates the pipeline stages (parsing, cleaning, splitting, indexing) without requiring incremental instructions. - Batch Processing -- The response includes a
batchidentifier, enabling callers to track the progress of multiple documents submitted together. This batch abstraction supports both single-document and bulk-import workflows. - Atomicity of First-Document Creation -- The
/datasets/initendpoint wraps dataset creation and document upload in a single transaction, ensuring that a partially-created dataset without any documents cannot exist in the system when using the initial setup flow.