Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Langgenius Dify Indexing Progress Monitoring

From Leeroopedia
Knowledge Sources Domains Last Updated
Dify RAG, Knowledge_Management, Frontend 2026-02-12 00:00 GMT

Overview

Description

Indexing Progress Monitoring provides real-time visibility into the asynchronous document processing pipeline within Dify. After a document is uploaded, it passes through a multi-stage pipeline before becoming available for retrieval. Each stage represents a distinct transformation:

  1. waiting -- The document is queued for processing.
  2. parsing -- Raw content is extracted from the source file (PDF, DOCX, etc.).
  3. cleaning -- Pre-processing rules are applied (removing extra whitespace, URLs, etc.).
  4. splitting -- The cleaned text is segmented into chunks according to the configured process rules.
  5. indexing -- Chunks are embedded and written to the vector store.
  6. completed -- All chunks are indexed and the document is available for retrieval.

Two exceptional states can also occur:

  • error -- An unrecoverable failure occurred during any pipeline stage.
  • paused -- The user or system explicitly paused processing.

Monitoring is available at two granularities: per-document (tracking a single document through its pipeline) and per-batch (tracking all documents submitted in a single upload operation).

Usage

  • Progress indicators -- The UI polls the indexing status endpoint to display a progress bar showing completed segments vs. total segments.
  • Stage-level timestamps -- Each stage records a completion timestamp (parsing_completed_at, cleaning_completed_at, splitting_completed_at, completed_at), enabling performance analysis and bottleneck detection.
  • Error handling -- When the status transitions to error, the error field contains diagnostic information that can be surfaced to the user.
  • Pause/resume control -- Users can pause and resume indexing using companion endpoints (pauseDocIndexing, resumeDocIndexing), and the status endpoint reflects these state transitions in real time.
  • Batch monitoring -- When multiple documents are uploaded together, fetchIndexingStatusBatch retrieves the status of all documents in the batch with a single request.

Theoretical Basis

  • Finite State Machine -- The indexing pipeline is modeled as a deterministic state machine with a linear progression from waiting through completed, plus two deviation states (error and paused). This makes status transitions predictable and easy to reason about in UI logic.
  • Polling-Based Observability -- Since document processing is asynchronous (executed via Celery workers with Redis as the broker), the frontend cannot rely on synchronous responses. Instead, it adopts a polling pattern, periodically fetching the status until a terminal state (completed, error) is reached.
  • Segment-Level Granularity -- The completed_segments and total_segments fields provide fractional progress information within the indexing stage itself, enabling smooth progress bar updates rather than coarse stage-level jumps.
  • Batch Abstraction -- Grouping documents by batch ID decouples the upload action from individual document tracking, allowing the system to efficiently report on bulk operations without requiring per-document polling loops.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment