Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Langgenius Dify CreateDocument

From Leeroopedia
Knowledge Sources Domains Last Updated
Dify RAG, Knowledge_Management, Frontend 2026-02-12 00:00 GMT

Overview

Description

createDocument and createFirstDocument are frontend service functions that submit documents to the Dify backend for ingestion into a knowledge base. Both functions accept a CreateDocumentReq body that describes the data source, processing rules, chunking mode, embedding model, and retrieval model configuration. The backend then asynchronously executes the full document processing pipeline (parsing, cleaning, splitting, embedding, indexing).

createDocument targets an existing dataset, while createFirstDocument atomically creates a new dataset and its first document through the /datasets/init endpoint.

Usage

  • Use createDocument when adding documents to a dataset that already exists.
  • Use createFirstDocument during the initial knowledge base setup wizard where dataset creation and first document upload happen in one step.
  • The returned batch identifier can be used with fetchIndexingStatusBatch to monitor the progress of document processing.

Code Reference

Source Location

web/service/datasets.ts, lines 133--139.

Signature

export const createDocument = (
  { datasetId, body }: { datasetId: string, body: CreateDocumentReq }
): Promise<createDocumentResponse> => {
  return post<createDocumentResponse>(`/datasets/${datasetId}/documents`, { body })
}

export const createFirstDocument = (
  { body }: { body: CreateDocumentReq }
): Promise<createDocumentResponse> => {
  return post<createDocumentResponse>('/datasets/init', { body })
}

Import

import { createDocument, createFirstDocument } from '@/service/datasets'

I/O Contract

Inputs

Parameter Type Required Description
datasetId string Yes (for createDocument) The ID of the target dataset.
body CreateDocumentReq Yes Full document creation specification.

CreateDocumentReq fields:

Field Type Description
data_source DataSource Source configuration with type (upload_file, notion_import, website_crawl) and corresponding info_list.
doc_form ChunkingMode Chunking strategy: text_model, qa_model, or hierarchical_model.
doc_language string Language of the document content (e.g., 'English').
process_rule ProcessRule Segmentation and pre-processing rules (separator, max_tokens, chunk_overlap, pre-processing toggles).
retrieval_model RetrievalConfig Retrieval configuration (search method, top_k, score threshold, reranking settings).
embedding_model string Name of the embedding model to use.
embedding_model_provider string Provider of the embedding model.
indexing_technique IndexingType Optional. Indexing technique override.
original_document_id string Optional. ID of an existing document being re-uploaded.

Outputs

Returns Promise<createDocumentResponse>:

Field Type Description
dataset DataSet ¦ undefined The dataset object (present when using createFirstDocument).
batch string Batch identifier for tracking processing progress.
documents InitialDocumentDetail[] Array of created document records with their initial indexing status.

Usage Examples

Uploading a file to an existing dataset

import { createDocument } from '@/service/datasets'

const response = await createDocument({
  datasetId: 'ds-abc123',
  body: {
    data_source: {
      type: 'upload_file',
      info_list: {
        data_source_type: 'upload_file',
        file_info_list: { file_ids: ['file-xyz789'] },
      },
    },
    doc_form: 'text_model',
    doc_language: 'English',
    process_rule: {
      mode: 'custom',
      rules: {
        pre_processing_rules: [{ id: 'remove_extra_spaces', enabled: true }],
        segmentation: { separator: '\n\n', max_tokens: 500, chunk_overlap: 50 },
        parent_mode: 'full-doc',
        subchunk_segmentation: { separator: '\n', max_tokens: 200 },
      },
    },
    retrieval_model: { search_method: 'semantic_search', top_k: 3, score_threshold_enabled: true, score_threshold: 0.5, reranking_enable: false, reranking_model: { reranking_provider_name: '', reranking_model_name: '' } },
    embedding_model: 'text-embedding-ada-002',
    embedding_model_provider: 'openai',
  },
})

console.log(response.batch) // Use to track indexing progress

Creating a dataset with its first document

import { createFirstDocument } from '@/service/datasets'

const response = await createFirstDocument({
  body: {
    data_source: {
      type: 'upload_file',
      info_list: {
        data_source_type: 'upload_file',
        file_info_list: { file_ids: ['file-first001'] },
      },
    },
    doc_form: 'qa_model',
    doc_language: 'English',
    process_rule: {
      mode: 'custom',
      rules: {
        pre_processing_rules: [],
        segmentation: { separator: '\n', max_tokens: 1000 },
        parent_mode: 'full-doc',
        subchunk_segmentation: { separator: '\n', max_tokens: 300 },
      },
    },
    retrieval_model: { search_method: 'hybrid_search', top_k: 5, score_threshold_enabled: false, score_threshold: 0, reranking_enable: true, reranking_model: { reranking_provider_name: 'cohere', reranking_model_name: 'rerank-english-v2.0' } },
    embedding_model: 'text-embedding-3-small',
    embedding_model_provider: 'openai',
  },
})

const newDatasetId = response.dataset?.id

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment