Implementation:Langgenius Dify CreateDocument

Knowledge Sources	Domains	Last Updated
Dify	RAG, Knowledge_Management, Frontend	2026-02-12 00:00 GMT

Overview

Description

createDocument and createFirstDocument are frontend service functions that submit documents to the Dify backend for ingestion into a knowledge base. Both functions accept a CreateDocumentReq body that describes the data source, processing rules, chunking mode, embedding model, and retrieval model configuration. The backend then asynchronously executes the full document processing pipeline (parsing, cleaning, splitting, embedding, indexing).

createDocument targets an existing dataset, while createFirstDocument atomically creates a new dataset and its first document through the /datasets/init endpoint.

Usage

Use createDocument when adding documents to a dataset that already exists.
Use createFirstDocument during the initial knowledge base setup wizard where dataset creation and first document upload happen in one step.
The returned batch identifier can be used with fetchIndexingStatusBatch to monitor the progress of document processing.

Code Reference

Source Location

web/service/datasets.ts, lines 133--139.

Signature

export const createDocument = (
  { datasetId, body }: { datasetId: string, body: CreateDocumentReq }
): Promise<createDocumentResponse> => {
  return post<createDocumentResponse>(`/datasets/${datasetId}/documents`, { body })
}

export const createFirstDocument = (
  { body }: { body: CreateDocumentReq }
): Promise<createDocumentResponse> => {
  return post<createDocumentResponse>('/datasets/init', { body })
}

Import

import { createDocument, createFirstDocument } from '@/service/datasets'

I/O Contract

Inputs

Parameter	Type	Required	Description
`datasetId`	`string`	Yes (for `createDocument`)	The ID of the target dataset.
`body`	`CreateDocumentReq`	Yes	Full document creation specification.

CreateDocumentReq fields:

Field	Type	Description
`data_source`	`DataSource`	Source configuration with `type` (`upload_file`, `notion_import`, `website_crawl`) and corresponding `info_list`.
`doc_form`	`ChunkingMode`	Chunking strategy: `text_model`, `qa_model`, or `hierarchical_model`.
`doc_language`	`string`	Language of the document content (e.g., `'English'`).
`process_rule`	`ProcessRule`	Segmentation and pre-processing rules (separator, max_tokens, chunk_overlap, pre-processing toggles).
`retrieval_model`	`RetrievalConfig`	Retrieval configuration (search method, top_k, score threshold, reranking settings).
`embedding_model`	`string`	Name of the embedding model to use.
`embedding_model_provider`	`string`	Provider of the embedding model.
`indexing_technique`	`IndexingType`	Optional. Indexing technique override.
`original_document_id`	`string`	Optional. ID of an existing document being re-uploaded.

Outputs

Returns Promise<createDocumentResponse>:

Field	Type	Description
`dataset`	`DataSet ¦ undefined`	The dataset object (present when using `createFirstDocument`).
`batch`	`string`	Batch identifier for tracking processing progress.
`documents`	`InitialDocumentDetail[]`	Array of created document records with their initial indexing status.

Usage Examples

Uploading a file to an existing dataset

import { createDocument } from '@/service/datasets'

const response = await createDocument({
  datasetId: 'ds-abc123',
  body: {
    data_source: {
      type: 'upload_file',
      info_list: {
        data_source_type: 'upload_file',
        file_info_list: { file_ids: ['file-xyz789'] },
      },
    },
    doc_form: 'text_model',
    doc_language: 'English',
    process_rule: {
      mode: 'custom',
      rules: {
        pre_processing_rules: [{ id: 'remove_extra_spaces', enabled: true }],
        segmentation: { separator: '\n\n', max_tokens: 500, chunk_overlap: 50 },
        parent_mode: 'full-doc',
        subchunk_segmentation: { separator: '\n', max_tokens: 200 },
      },
    },
    retrieval_model: { search_method: 'semantic_search', top_k: 3, score_threshold_enabled: true, score_threshold: 0.5, reranking_enable: false, reranking_model: { reranking_provider_name: '', reranking_model_name: '' } },
    embedding_model: 'text-embedding-ada-002',
    embedding_model_provider: 'openai',
  },
})

console.log(response.batch) // Use to track indexing progress

Creating a dataset with its first document

import { createFirstDocument } from '@/service/datasets'

const response = await createFirstDocument({
  body: {
    data_source: {
      type: 'upload_file',
      info_list: {
        data_source_type: 'upload_file',
        file_info_list: { file_ids: ['file-first001'] },
      },
    },
    doc_form: 'qa_model',
    doc_language: 'English',
    process_rule: {
      mode: 'custom',
      rules: {
        pre_processing_rules: [],
        segmentation: { separator: '\n', max_tokens: 1000 },
        parent_mode: 'full-doc',
        subchunk_segmentation: { separator: '\n', max_tokens: 300 },
      },
    },
    retrieval_model: { search_method: 'hybrid_search', top_k: 5, score_threshold_enabled: false, score_threshold: 0, reranking_enable: true, reranking_model: { reranking_provider_name: 'cohere', reranking_model_name: 'rerank-english-v2.0' } },
    embedding_model: 'text-embedding-3-small',
    embedding_model_provider: 'openai',
  },
})

const newDatasetId = response.dataset?.id

Related Pages

Principle:Langgenius_Dify_Document_Upload

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment