Implementation:FlagOpen FlagEmbedding LLM Embedder Eval LRLM
| Knowledge Sources | |
|---|---|
| Domains | Large_Language_Models, Long_Context, Retrieval_Augmentation |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Evaluation script for Long-Range Language Modeling (LRLM) with self-retrieval augmentation measuring perplexity on long documents.
Description
This implementation evaluates language models on long-context understanding by computing perplexity on documents up to 160K tokens. The SelfRetrievalLM wrapper augments the base LM with retrieval capabilities:
- Chunks long documents into smaller segments
- Uses a retriever (dense or BM25) to find relevant chunks based on current context
- Integrates retrieved information into the generation context
- Computes perplexity on held-out target tokens
The evaluation supports various retrieval methods (dense encoders, BM25, or no retrieval), different chunk integration strategies (order_method, integrate_method), and optional task-specific instructions. Data processing handles left-truncation to fit context windows while preserving target sequences for perplexity computation.
Usage
Use this to evaluate how well retrieval-augmented LMs handle long-context language modeling tasks, measuring whether self-retrieval improves perplexity on lengthy documents.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/llm_embedder/evaluation/eval_lrlm.py
- Lines: 1-190
Signature
def process_lrlm(tokenizer, context_max_length=4096, target_length=1024,
anchor_length=160000)
def main() # Entry point with LRLMArgs configuration
Import
from research.llm_embedder.evaluation.eval_lrlm import main
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| eval_data | str | Yes | Path to JSON file with long text documents |
| model_name_or_path | str | Yes | HuggingFace model name or path |
| retrieval_method | str | Yes | Retrieval method: "dense", "naive-bm25", or "no" |
| context_max_length | int | No | Maximum context length (default: 32768) |
| anchor_length | int | No | Maximum document length to process (default: 160000) |
| chunk_size | int | No | Chunk size for retrieval (default: 128) |
| key_num | int | No | Number of chunks to retrieve (default: 8) |
Outputs
| Name | Type | Description |
|---|---|---|
| perplexity | float | Perplexity on target tokens |
| log_file | JSON | Results saved to log_path with configuration |
Usage Examples
# Command line usage
python research/llm_embedder/evaluation/eval_lrlm.py \
--eval_data llm-embedder:lrlm/books3/test.json \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--retrieval_method dense \
--query_encoder BAAI/bge-large-en-v1.5 \
--context_max_length 4096 \
--target_length 1024 \
--anchor_length 160000 \
--chunk_size 128 \
--key_num 8 \
--log_path data/results/lrlm
# Results: {"perplexity": 12.34, ...}