Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FlagOpen FlagEmbedding LLM Embedder Eval LRLM

From Leeroopedia


Knowledge Sources
Domains Large_Language_Models, Long_Context, Retrieval_Augmentation
Last Updated 2026-02-09 00:00 GMT

Overview

Evaluation script for Long-Range Language Modeling (LRLM) with self-retrieval augmentation measuring perplexity on long documents.

Description

This implementation evaluates language models on long-context understanding by computing perplexity on documents up to 160K tokens. The SelfRetrievalLM wrapper augments the base LM with retrieval capabilities:

  • Chunks long documents into smaller segments
  • Uses a retriever (dense or BM25) to find relevant chunks based on current context
  • Integrates retrieved information into the generation context
  • Computes perplexity on held-out target tokens

The evaluation supports various retrieval methods (dense encoders, BM25, or no retrieval), different chunk integration strategies (order_method, integrate_method), and optional task-specific instructions. Data processing handles left-truncation to fit context windows while preserving target sequences for perplexity computation.

Usage

Use this to evaluate how well retrieval-augmented LMs handle long-context language modeling tasks, measuring whether self-retrieval improves perplexity on lengthy documents.

Code Reference

Source Location

Signature

def process_lrlm(tokenizer, context_max_length=4096, target_length=1024,
                 anchor_length=160000)

def main()  # Entry point with LRLMArgs configuration

Import

from research.llm_embedder.evaluation.eval_lrlm import main

I/O Contract

Inputs

Name Type Required Description
eval_data str Yes Path to JSON file with long text documents
model_name_or_path str Yes HuggingFace model name or path
retrieval_method str Yes Retrieval method: "dense", "naive-bm25", or "no"
context_max_length int No Maximum context length (default: 32768)
anchor_length int No Maximum document length to process (default: 160000)
chunk_size int No Chunk size for retrieval (default: 128)
key_num int No Number of chunks to retrieve (default: 8)

Outputs

Name Type Description
perplexity float Perplexity on target tokens
log_file JSON Results saved to log_path with configuration

Usage Examples

# Command line usage
python research/llm_embedder/evaluation/eval_lrlm.py \
    --eval_data llm-embedder:lrlm/books3/test.json \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --retrieval_method dense \
    --query_encoder BAAI/bge-large-en-v1.5 \
    --context_max_length 4096 \
    --target_length 1024 \
    --anchor_length 160000 \
    --chunk_size 128 \
    --key_num 8 \
    --log_path data/results/lrlm

# Results: {"perplexity": 12.34, ...}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment