Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Langgenius Dify Indexing Method Selection

From Leeroopedia


Knowledge Sources
Domains RAG Indexing Information Retrieval
Last Updated 2026-02-08 00:00 GMT

Overview

Indexing method selection is the process of choosing between embedding-based (high-quality) and keyword-based (economy) indexing strategies, along with the associated search method, to balance retrieval quality against cost and latency.

Description

Once documents have been chunked into segments, those segments must be indexed so they can be efficiently retrieved at query time. The indexing method determines how segments are represented in the search index and, by extension, how queries are matched against them.

Dify exposes two indexing techniques:

  • High-quality (embedding-based) -- each segment is transformed into a dense vector using an embedding model. Queries are similarly embedded, and retrieval uses vector similarity (cosine or dot product). This approach captures semantic meaning and handles paraphrases, synonyms, and conceptual similarity.
  • Economy (keyword-based) -- segments are indexed using an inverted index built on token frequencies (TF-IDF or BM25). Retrieval matches on exact or stemmed keywords. This approach is fast, inexpensive (no embedding model required), and effective when queries use domain-specific terminology that appears verbatim in the source text.

Within the high-quality technique, Dify further offers three search methods:

Search Method Mechanism Best For
Semantic search Pure vector similarity Natural-language questions, conceptual queries
Full-text search Keyword-based ranking (inverted index) Exact term matching, technical identifiers
Hybrid search Combines vector and keyword scores General-purpose retrieval where both meaning and terminology matter

The economy technique always uses an inverted index, so its search method is implicitly full-text.

Usage

Select an indexing method when:

  • Creating a knowledge base -- the user chooses between high-quality and economy during the initial setup wizard.
  • Evaluating cost vs. quality trade-offs -- high-quality indexing requires an embedding model (and therefore API credits or GPU resources), while economy indexing runs locally with no model dependency.
  • Configuring retrieval behavior -- after choosing high-quality indexing, the user selects a search method that determines how queries are matched.

Theoretical Basis

Embedding-Based Indexing

Embedding-based indexing works by projecting both documents and queries into a shared high-dimensional vector space:

Segment text  --[Embedding Model]-->  Vector (d dimensions)
Query text    --[Embedding Model]-->  Vector (d dimensions)

Similarity = cosine(query_vector, segment_vector)

Advantages:

  • Captures semantic similarity -- "automobile" matches "car"
  • Robust to paraphrasing and word-order variation
  • Supports cross-lingual retrieval when multilingual models are used

Disadvantages:

  • Requires an embedding model (cost, latency)
  • May miss exact keyword matches that a lexical index would catch
  • Vector index consumes more memory than an inverted index

Keyword-Based Indexing

Keyword-based indexing builds an inverted index mapping tokens to the segments that contain them:

Token "retrieval"  -->  [Segment_3, Segment_17, Segment_42]
Token "augmented"  -->  [Segment_3, Segment_8]

Score(query, segment) = BM25(query_tokens, segment_tokens)

Advantages:

  • No model dependency -- fast, cheap, and deterministic
  • Excellent for exact-match scenarios (error codes, product IDs, proper nouns)
  • Well-understood ranking algorithms (BM25)

Disadvantages:

  • No semantic understanding -- "car" does not match "automobile"
  • Sensitive to vocabulary mismatch between query and document

Hybrid Search

Hybrid search fuses the results of both vector and keyword retrieval:

1. Run semantic search   -->  ranked list S
2. Run full-text search  -->  ranked list K
3. Merge S and K using weighted scoring or reciprocal rank fusion
4. Return top-k results from the merged list

This approach mitigates the weaknesses of each individual method: semantic search covers paraphrases, while keyword search catches exact matches. The relative weighting can be tuned to favor one signal over the other depending on the domain.

Display and Localization

Because indexing technique and method names are user-facing, they must be translated and formatted for display. The formatting logic maps internal enum values to localized labels:

"high_quality"     -->  "High Quality"  (or locale equivalent)
"economy"          -->  "Economy"
"semantic_search"  -->  "Semantic Search"
"full_text_search" -->  "Full-Text Search"
"hybrid_search"    -->  "Hybrid Search"

When the technique is economy, the method is always displayed as Inverted Index regardless of the stored method value.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment