Principle:Langgenius Dify Knowledge Base Verification

Knowledge Sources	Dify Dify Knowledge Base Docs RAGAS: Automated Evaluation of Retrieval Augmented Generation
Domains	RAG Quality Assurance Retrieval Testing
Last Updated	2026-02-08 00:00 GMT

Overview

Knowledge base verification is the practice of validating retrieval quality through test queries, ensuring that a knowledge base returns relevant, accurately scored results before it is used in production applications.

Description

Building a knowledge base involves many configuration decisions: data source selection, chunking parameters, indexing method, and retrieval settings. Each of these decisions affects the quality of results that the knowledge base returns. Knowledge base verification (also called hit testing or retrieval debugging) is the final quality gate in the creation workflow -- it allows an operator to submit test queries and inspect the retrieved chunks before connecting the knowledge base to a live application.

Verification serves several purposes:

Relevance validation -- confirming that expected chunks appear in the result set for representative queries.
Scoring analysis -- examining relevance scores to understand how confidently the retriever matches queries to segments.
Configuration feedback -- providing empirical evidence for tuning decisions (adjusting top-k, threshold, chunk size, or search method).
Regression detection -- after re-processing or configuration changes, verifying that retrieval quality has not degraded.

Without verification, operators deploy knowledge bases blindly, discovering quality issues only when end users report incorrect or irrelevant responses from the LLM application.

Usage

Perform knowledge base verification when:

A new knowledge base has finished indexing -- before connecting it to any application, run a suite of representative queries to validate retrieval quality.
Processing configuration has changed -- after adjusting chunk size, overlap, or separators, re-run verification to measure the impact.
Retrieval configuration has changed -- after modifying search method, reranking settings, or score thresholds, verify that results improve.
New documents have been added -- confirm that the new content is retrievable and does not displace previously relevant results.
Users report quality issues -- use hit testing to diagnose whether the problem is in retrieval (wrong chunks returned) or generation (LLM misinterpreting correct chunks).

Theoretical Basis

Hit Testing Model

Hit testing submits a query against a specific dataset and returns the matched segments with their scores:

Input:
  - dataset_id: identifies the knowledge base to test
  - query_text: the test query string
  - retrieval_model: the retrieval configuration to use

Process:
  1. Encode query_text using the dataset's embedding model (if semantic/hybrid)
  2. Search the dataset's index using the specified retrieval configuration
  3. Apply reranking if configured
  4. Apply score threshold filtering
  5. Return top-k results

Output:
  - records[]: array of matched segments, each containing:
      - segment text (the actual chunk content)
      - relevance score (0.0 to 1.0)
      - word count
      - hit count (how many times this segment has been matched)
      - segment metadata (document name, position, etc.)

Interpreting Results

Score Distribution

A healthy knowledge base produces a clear separation between relevant and irrelevant segments:

Good distribution:          Poor distribution:
Score                       Score
1.0 |  ██                   1.0 |
0.8 |  ██ ██                0.8 |
0.6 |  ██ ██ ██             0.6 |  ██ ██ ██ ██ ██
0.4 |                       0.4 |  ██ ██ ██ ██ ██
0.2 |  ██ ██                0.2 |  ██ ██ ██ ██ ██
    +----------              +----------
     Segments                 Segments

In the good distribution, highly relevant segments score well above the threshold and irrelevant ones score well below. In the poor distribution, scores are clustered in the mid-range, making it hard to distinguish relevant from irrelevant results.

Common Issues and Remedies

Symptom	Likely Cause	Remedy
No results returned	Score threshold too high, or content not indexed	Lower threshold; verify indexing completed
Irrelevant results ranked highly	Chunks too large, mixing topics	Reduce chunk size; add separators
Relevant content missing	Chunks too small, splitting key passages	Increase chunk size; add overlap
Duplicate near-identical results	High overlap between segments	Reduce overlap; enable deduplication
Good semantic but poor keyword matches	Using only semantic search	Switch to hybrid search
Good keyword but poor conceptual matches	Using only full-text search	Switch to semantic or hybrid search

Verification Workflow

A systematic verification process follows these steps:

1. Define test queries
   - Representative queries that users are expected to ask
   - Edge cases (ambiguous terms, multi-topic queries)
   - Known-answer queries (where the expected segment is known)

2. Run hit testing for each query

3. Evaluate results
   - Are expected segments in the top-k?
   - Are relevance scores appropriately distributed?
   - Are any irrelevant segments ranked too highly?

4. Iterate on configuration
   - Adjust chunking, indexing, or retrieval parameters
   - Re-run verification
   - Compare before/after results

5. Approve for production
   - When all test queries return satisfactory results
   - Document the final configuration and test results

Quantitative Metrics

For rigorous verification, operators can compute standard information retrieval metrics from hit testing results:

Metric	Definition	Interpretation
Precision@k	Fraction of top-k results that are relevant	Higher is better; measures result quality
Recall@k	Fraction of all relevant segments that appear in top-k	Higher is better; measures coverage
MRR (Mean Reciprocal Rank)	Average of 1/rank for the first relevant result	Higher is better; measures how quickly a relevant result appears
NDCG@k (Normalized Discounted Cumulative Gain)	Accounts for the position of relevant results in the ranked list	Higher is better; measures ranking quality

While Dify's hit testing UI does not compute these metrics automatically, operators can derive them by comparing hit testing results against a set of known-relevant segments.

Related Pages

Implemented By

Implementation:Langgenius_Dify_HitTesting

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment