Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Langgenius Dify Document Upload

From Leeroopedia
Revision as of 17:37, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Langgenius_Dify_Document_Upload.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources Domains Last Updated
Dify RAG, Knowledge_Management, Frontend 2026-02-12 00:00 GMT

Overview

Description

Document Upload is the process of ingesting source content into a Dify knowledge base (dataset) so that it can be chunked, embedded, and made available for retrieval. Documents are the primary unit of content within a dataset and can originate from multiple data sources: local file uploads, Notion workspace imports, or web crawling.

The upload process is tightly coupled with the document processing pipeline. When a document is created, the caller specifies not just the data source but also the full processing configuration: chunking mode, process rules, embedding model, and retrieval model. This declarative submission pattern means the entire processing intent is captured in a single request, allowing the backend to execute the ingestion pipeline asynchronously without further interaction.

Dify provides two distinct creation paths:

  • createDocument -- Adds a document to an existing dataset, identified by datasetId.
  • createFirstDocument -- Creates a dataset and its first document atomically via the /datasets/init endpoint, streamlining the common case where users upload a file as part of initial knowledge base setup.

Usage

  • File-based ingestion -- Upload PDFs, text files, Word documents, or other supported formats from the local filesystem.
  • Notion integration -- Import pages from a connected Notion workspace, with workspace and page identifiers specified in the data source configuration.
  • Web crawling -- Ingest content from websites using providers like Firecrawl, Jina Reader, or WaterCrawl, passing crawl job results as the data source.
  • Initial setup flow -- Use createFirstDocument to combine dataset creation and first document upload into a single step, reducing round-trips.

Theoretical Basis

  • Multi-Source Abstraction -- The DataSource type uses a discriminated union pattern (type: 'upload_file' | 'notion_import' | 'website_crawl') with corresponding info_list variants. This allows a single API surface to handle heterogeneous ingestion sources while maintaining type safety.
  • Declarative Pipeline Configuration -- By bundling process_rule, doc_form, embedding_model, and retrieval_model into the creation request (CreateDocumentReq), the system adopts a desired-state model. The backend interprets the full specification and orchestrates the pipeline stages (parsing, cleaning, splitting, indexing) without requiring incremental instructions.
  • Batch Processing -- The response includes a batch identifier, enabling callers to track the progress of multiple documents submitted together. This batch abstraction supports both single-document and bulk-import workflows.
  • Atomicity of First-Document Creation -- The /datasets/init endpoint wraps dataset creation and document upload in a single transaction, ensuring that a partially-created dataset without any documents cannot exist in the system when using the initial setup flow.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment