Principle:Langgenius Dify Retrieval Testing

Knowledge Sources	Domains	Last Updated
Dify	RAG, Knowledge_Management, Frontend	2026-02-12 00:00 GMT

Overview

Description

Retrieval Testing (also called hit testing) is the evaluation mechanism that allows users to verify how well a knowledge base responds to natural language queries before deploying it in a production application. By submitting a query against a specific dataset and observing which segments are returned, along with their relevance scores and positions, users can iteratively tune their chunking strategy, retrieval configuration, and content quality.

Retrieval testing operates as a sandboxed query environment: it uses the same retrieval pipeline that production applications use, but results are presented in a diagnostic view with detailed scoring, segment content, positional data (t-SNE coordinates for visualization), and child chunk information for hierarchical datasets.

The test results are also persisted as hit testing records, enabling users to review past queries and track retrieval quality over time.

Usage

Retrieval validation -- After indexing a document, run test queries to confirm that the expected segments are returned with high relevance scores.
Configuration tuning -- Experiment with different retrieval_model settings (search method, top_k, score threshold, reranking) and compare results across configurations without modifying the dataset's default settings.
Search method comparison -- Compare results from semantic_search, full_text_search, hybrid_search, and keyword_search to determine the optimal method for the content type.
Chunk quality assessment -- Identify segments with low scores or irrelevant content, then use segment management operations to update, disable, or remove them.
Visualization -- Use the t-SNE coordinates (tsne_position) in the response to render a 2D scatter plot of query and segment embeddings, providing an intuitive view of semantic similarity.

Theoretical Basis

Search Method Taxonomy -- Dify supports four retrieval methods, each with distinct trade-offs:
- Semantic search -- Computes cosine similarity between the query embedding and segment embeddings. Best for capturing meaning-based relevance.
- Full-text search -- Uses inverted index matching (BM25 or similar). Best for exact term matching and keyword-rich content.
- Hybrid search -- Combines semantic and full-text search with configurable weights, leveraging the strengths of both approaches.
- Keyword search -- Matches against extracted keywords stored with each segment.
Score Thresholding -- The score_threshold_enabled and score_threshold parameters implement a precision gate: only segments exceeding the threshold are returned. This filters out low-confidence matches that could degrade downstream LLM answer quality.
Top-K Truncation -- The top_k parameter limits the number of returned segments, controlling the trade-off between recall (more segments) and context window efficiency (fewer, more relevant segments).
Reranking -- When enabled, a secondary model re-scores the initially retrieved segments, improving precision by applying a more powerful cross-encoder or weighted scoring model. Dify supports both model-based reranking and weighted score reranking with configurable vector/keyword weights.
t-SNE Projection -- The response includes two-dimensional t-SNE positions for both the query and each retrieved segment. This dimensionality reduction enables visual inspection of how semantically close the results are to the query in embedding space.
Child Chunk Granularity -- For hierarchical datasets, hit testing returns both parent segments and their relevant child chunks with individual scores, supporting the coarse-to-fine retrieval pattern.

Related Pages

Implementation:Langgenius_Dify_HitTesting

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment