Implementation:FlagOpen FlagEmbedding LLM Embedder Eval LRLM

Knowledge Sources	FlagOpen_FlagEmbedding
Domains	Large_Language_Models, Long_Context, Retrieval_Augmentation
Last Updated	2026-02-09 00:00 GMT

Overview

Evaluation script for Long-Range Language Modeling (LRLM) with self-retrieval augmentation measuring perplexity on long documents.

Description

This implementation evaluates language models on long-context understanding by computing perplexity on documents up to 160K tokens. The SelfRetrievalLM wrapper augments the base LM with retrieval capabilities:

Chunks long documents into smaller segments
Uses a retriever (dense or BM25) to find relevant chunks based on current context
Integrates retrieved information into the generation context
Computes perplexity on held-out target tokens

The evaluation supports various retrieval methods (dense encoders, BM25, or no retrieval), different chunk integration strategies (order_method, integrate_method), and optional task-specific instructions. Data processing handles left-truncation to fit context windows while preserving target sequences for perplexity computation.

Usage

Use this to evaluate how well retrieval-augmented LMs handle long-context language modeling tasks, measuring whether self-retrieval improves perplexity on lengthy documents.

Code Reference

Source Location

Repository: FlagOpen_FlagEmbedding
File: research/llm_embedder/evaluation/eval_lrlm.py
Lines: 1-190

Signature

def process_lrlm(tokenizer, context_max_length=4096, target_length=1024,
                 anchor_length=160000)

def main()  # Entry point with LRLMArgs configuration

Import

from research.llm_embedder.evaluation.eval_lrlm import main

I/O Contract

Inputs

Name	Type	Required	Description
eval_data	str	Yes	Path to JSON file with long text documents
model_name_or_path	str	Yes	HuggingFace model name or path
retrieval_method	str	Yes	Retrieval method: "dense", "naive-bm25", or "no"
context_max_length	int	No	Maximum context length (default: 32768)
anchor_length	int	No	Maximum document length to process (default: 160000)
chunk_size	int	No	Chunk size for retrieval (default: 128)
key_num	int	No	Number of chunks to retrieve (default: 8)

Outputs

Name	Type	Description
perplexity	float	Perplexity on target tokens
log_file	JSON	Results saved to log_path with configuration

Usage Examples

# Command line usage
python research/llm_embedder/evaluation/eval_lrlm.py \
    --eval_data llm-embedder:lrlm/books3/test.json \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --retrieval_method dense \
    --query_encoder BAAI/bge-large-en-v1.5 \
    --context_max_length 4096 \
    --target_length 1024 \
    --anchor_length 160000 \
    --chunk_size 128 \
    --key_num 8 \
    --log_path data/results/lrlm

# Results: {"perplexity": 12.34, ...}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment