Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:FlagOpen FlagEmbedding LLM Dense Retrieval Training

From Leeroopedia


Knowledge Sources
Domains Machine Learning, Large Language Models, Information Retrieval, In-Context Learning
Last Updated 2026-02-09 00:00 GMT

Overview

Training LLM-based dense retrievers that leverage in-context learning and instruction-following capabilities to produce task-adaptive embeddings through prompted representation generation.

Description

This principle extends dense retrieval to leverage the instruction-following and in-context learning capabilities of large language models. Instead of training separate encoders for queries and documents, the approach uses a single LLM that generates task-specific embeddings based on instructions. For example, the model receives prompts like "Represent this query for retrieving relevant passages:" followed by the query text, or "Represent this document for retrieval:" followed by the document. The LLM processes these instructions and generates embeddings from its final hidden states (typically via pooling). Training combines contrastive learning objectives with instruction tuning, enabling the model to adapt its representations based on the task description. This allows a single model to handle multiple retrieval tasks (web search, QA, semantic similarity) by changing the instruction prompt.

Usage

Use this principle when:

  • Building instruction-aware retrieval systems
  • Leveraging LLM capabilities for embedding generation
  • Creating multi-task retrievers controlled via natural language prompts
  • Developing retrieval systems that benefit from in-context reasoning

Theoretical Basis

The LLM dense retrieval framework consists of:

  1. Instruction-based Encoding:
    • Query embedding: q = Pool(LLM(I_query + Q))
    • Document embedding: d = Pool(LLM(I_doc + D))
    • Where I_query, I_doc are instruction templates
  1. Pooling Strategies:
    • Last token pooling: h = hidden_state[-1] (for decoder-only models)
    • Mean pooling: h = mean(hidden_states)
    • Attention-weighted pooling: h = Σ_i α_i * hidden_state_i
  1. Contrastive Training:
    • Similarity: s(q, d) = cosine(q, d)
    • Loss: L = -log(exp(s(q, d+)/τ) / Σ_i exp(s(q, d_i)/τ))
    • With instruction-augmented inputs
  1. Multi-task Training:
    • Sample tasks: {task_1, task_2, ..., task_n}
    • Task-specific instructions: I_task
    • Unified loss: L = Σ_tasks L_task
  1. Instruction Templates:
    • Query: "Instruct: {task_description}\nQuery: {text}"
    • Document: "Document: {text}"
    • Task descriptions specify retrieval objective (e.g., "retrieve scientific papers", "find relevant code")

The key advantage is task adaptability: the same model handles different retrieval scenarios by changing the instruction, without retraining.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment