Principle:Langgenius Dify Indexing Progress Monitoring
Appearance
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| Dify | RAG, Knowledge_Management, Frontend | 2026-02-12 00:00 GMT |
Overview
Description
Indexing Progress Monitoring provides real-time visibility into the asynchronous document processing pipeline within Dify. After a document is uploaded, it passes through a multi-stage pipeline before becoming available for retrieval. Each stage represents a distinct transformation:
- waiting -- The document is queued for processing.
- parsing -- Raw content is extracted from the source file (PDF, DOCX, etc.).
- cleaning -- Pre-processing rules are applied (removing extra whitespace, URLs, etc.).
- splitting -- The cleaned text is segmented into chunks according to the configured process rules.
- indexing -- Chunks are embedded and written to the vector store.
- completed -- All chunks are indexed and the document is available for retrieval.
Two exceptional states can also occur:
- error -- An unrecoverable failure occurred during any pipeline stage.
- paused -- The user or system explicitly paused processing.
Monitoring is available at two granularities: per-document (tracking a single document through its pipeline) and per-batch (tracking all documents submitted in a single upload operation).
Usage
- Progress indicators -- The UI polls the indexing status endpoint to display a progress bar showing completed segments vs. total segments.
- Stage-level timestamps -- Each stage records a completion timestamp (
parsing_completed_at,cleaning_completed_at,splitting_completed_at,completed_at), enabling performance analysis and bottleneck detection. - Error handling -- When the status transitions to
error, theerrorfield contains diagnostic information that can be surfaced to the user. - Pause/resume control -- Users can pause and resume indexing using companion endpoints (
pauseDocIndexing,resumeDocIndexing), and the status endpoint reflects these state transitions in real time. - Batch monitoring -- When multiple documents are uploaded together,
fetchIndexingStatusBatchretrieves the status of all documents in the batch with a single request.
Theoretical Basis
- Finite State Machine -- The indexing pipeline is modeled as a deterministic state machine with a linear progression from
waitingthroughcompleted, plus two deviation states (errorandpaused). This makes status transitions predictable and easy to reason about in UI logic. - Polling-Based Observability -- Since document processing is asynchronous (executed via Celery workers with Redis as the broker), the frontend cannot rely on synchronous responses. Instead, it adopts a polling pattern, periodically fetching the status until a terminal state (
completed,error) is reached. - Segment-Level Granularity -- The
completed_segmentsandtotal_segmentsfields provide fractional progress information within the indexing stage itself, enabling smooth progress bar updates rather than coarse stage-level jumps. - Batch Abstraction -- Grouping documents by batch ID decouples the upload action from individual document tracking, allowing the system to efficiently report on bulk operations without requiring per-document polling loops.
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment