Principle:Langgenius Dify Knowledge Base Verification
| Knowledge Sources | |
|---|---|
| Domains | RAG Quality Assurance Retrieval Testing |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Knowledge base verification is the practice of validating retrieval quality through test queries, ensuring that a knowledge base returns relevant, accurately scored results before it is used in production applications.
Description
Building a knowledge base involves many configuration decisions: data source selection, chunking parameters, indexing method, and retrieval settings. Each of these decisions affects the quality of results that the knowledge base returns. Knowledge base verification (also called hit testing or retrieval debugging) is the final quality gate in the creation workflow -- it allows an operator to submit test queries and inspect the retrieved chunks before connecting the knowledge base to a live application.
Verification serves several purposes:
- Relevance validation -- confirming that expected chunks appear in the result set for representative queries.
- Scoring analysis -- examining relevance scores to understand how confidently the retriever matches queries to segments.
- Configuration feedback -- providing empirical evidence for tuning decisions (adjusting top-k, threshold, chunk size, or search method).
- Regression detection -- after re-processing or configuration changes, verifying that retrieval quality has not degraded.
Without verification, operators deploy knowledge bases blindly, discovering quality issues only when end users report incorrect or irrelevant responses from the LLM application.
Usage
Perform knowledge base verification when:
- A new knowledge base has finished indexing -- before connecting it to any application, run a suite of representative queries to validate retrieval quality.
- Processing configuration has changed -- after adjusting chunk size, overlap, or separators, re-run verification to measure the impact.
- Retrieval configuration has changed -- after modifying search method, reranking settings, or score thresholds, verify that results improve.
- New documents have been added -- confirm that the new content is retrievable and does not displace previously relevant results.
- Users report quality issues -- use hit testing to diagnose whether the problem is in retrieval (wrong chunks returned) or generation (LLM misinterpreting correct chunks).
Theoretical Basis
Hit Testing Model
Hit testing submits a query against a specific dataset and returns the matched segments with their scores:
Input:
- dataset_id: identifies the knowledge base to test
- query_text: the test query string
- retrieval_model: the retrieval configuration to use
Process:
1. Encode query_text using the dataset's embedding model (if semantic/hybrid)
2. Search the dataset's index using the specified retrieval configuration
3. Apply reranking if configured
4. Apply score threshold filtering
5. Return top-k results
Output:
- records[]: array of matched segments, each containing:
- segment text (the actual chunk content)
- relevance score (0.0 to 1.0)
- word count
- hit count (how many times this segment has been matched)
- segment metadata (document name, position, etc.)
Interpreting Results
Score Distribution
A healthy knowledge base produces a clear separation between relevant and irrelevant segments:
Good distribution: Poor distribution:
Score Score
1.0 | ██ 1.0 |
0.8 | ██ ██ 0.8 |
0.6 | ██ ██ ██ 0.6 | ██ ██ ██ ██ ██
0.4 | 0.4 | ██ ██ ██ ██ ██
0.2 | ██ ██ 0.2 | ██ ██ ██ ██ ██
+---------- +----------
Segments Segments
In the good distribution, highly relevant segments score well above the threshold and irrelevant ones score well below. In the poor distribution, scores are clustered in the mid-range, making it hard to distinguish relevant from irrelevant results.
Common Issues and Remedies
| Symptom | Likely Cause | Remedy |
|---|---|---|
| No results returned | Score threshold too high, or content not indexed | Lower threshold; verify indexing completed |
| Irrelevant results ranked highly | Chunks too large, mixing topics | Reduce chunk size; add separators |
| Relevant content missing | Chunks too small, splitting key passages | Increase chunk size; add overlap |
| Duplicate near-identical results | High overlap between segments | Reduce overlap; enable deduplication |
| Good semantic but poor keyword matches | Using only semantic search | Switch to hybrid search |
| Good keyword but poor conceptual matches | Using only full-text search | Switch to semantic or hybrid search |
Verification Workflow
A systematic verification process follows these steps:
1. Define test queries
- Representative queries that users are expected to ask
- Edge cases (ambiguous terms, multi-topic queries)
- Known-answer queries (where the expected segment is known)
2. Run hit testing for each query
3. Evaluate results
- Are expected segments in the top-k?
- Are relevance scores appropriately distributed?
- Are any irrelevant segments ranked too highly?
4. Iterate on configuration
- Adjust chunking, indexing, or retrieval parameters
- Re-run verification
- Compare before/after results
5. Approve for production
- When all test queries return satisfactory results
- Document the final configuration and test results
Quantitative Metrics
For rigorous verification, operators can compute standard information retrieval metrics from hit testing results:
| Metric | Definition | Interpretation |
|---|---|---|
| Precision@k | Fraction of top-k results that are relevant | Higher is better; measures result quality |
| Recall@k | Fraction of all relevant segments that appear in top-k | Higher is better; measures coverage |
| MRR (Mean Reciprocal Rank) | Average of 1/rank for the first relevant result | Higher is better; measures how quickly a relevant result appears |
| NDCG@k (Normalized Discounted Cumulative Gain) | Accounts for the position of relevant results in the ranked list | Higher is better; measures ranking quality |
While Dify's hit testing UI does not compute these metrics automatically, operators can derive them by comparing hit testing results against a set of known-relevant segments.