Principle:Huggingface Peft Cosine Similarity Contrastive Training

Metadata

Sources: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, LoRA: Low-Rank Adaptation of Large Language Models
Domains: NLP, Semantic_Search

Overview

Cosine Similarity Contrastive Training is a training paradigm for learning dense text embeddings suitable for semantic search. It employs a bi-encoder architecture where queries and documents are independently encoded into fixed-size vectors, and their semantic similarity is measured using cosine similarity. A contrastive loss function trains the model so that semantically related pairs have high cosine similarity while unrelated pairs have low similarity. When combined with PEFT adapters (specifically LoRA with FEATURE_EXTRACTION task type), this approach enables efficient adaptation of pretrained language models for domain-specific semantic search.

Theoretical Foundation

Bi-Encoder Architecture

In a bi-encoder setup, both the query and the document pass through the same encoder model (with shared weights) to produce dense vector representations:

e_q = Encoder(query)       # query embedding
e_d = Encoder(document)    # document embedding

The encoder typically consists of a pretrained transformer followed by a mean pooling layer that aggregates token-level representations into a single vector, and an optional L2 normalization step that projects the embedding onto the unit hypersphere:

mean_pool(H, mask) = sum(H * mask_expanded, dim=1) / clamp(sum(mask_expanded, dim=1), min=1e-9)
embedding = L2_normalize(mean_pool(model_output, attention_mask))

Mean pooling weighted by the attention mask ensures that padding tokens do not contribute to the final representation. L2 normalization ensures that cosine similarity reduces to a simple dot product, improving numerical stability and search efficiency.

Cosine Similarity

Cosine similarity measures the angle between two vectors in embedding space, independent of their magnitudes:

cos_sim(e_q, e_d) = (e_q . e_d) / (||e_q|| * ||e_d||)

When vectors are L2-normalized (as in the bi-encoder), this simplifies to the dot product:

cos_sim(e_q, e_d) = sum(e_q * e_d)

Values range from -1 (opposite) to 1 (identical direction), with 0 indicating orthogonality (no semantic relationship).

Contrastive Loss

The training loss encourages high cosine similarity for positive pairs (relevant query-document) and low similarity for negative pairs (irrelevant query-document). A common formulation is the cosine margin loss:

L = mean( labels * (1 - cos_sim)^2 + (1 - labels) * max(cos_sim, 0)^2 )

where:

labels = 1 for positive pairs: the loss penalizes low cosine similarity
labels = 0 for negative pairs: the loss penalizes positive cosine similarity (clamped at 0 to avoid penalizing already-negative similarities)

This loss has a margin-like behavior: negative pairs are only penalized when their cosine similarity is above zero, encouraging clear separation in the embedding space.

LoRA for Feature Extraction

PEFT's TaskType.FEATURE_EXTRACTION is designed for models that produce embeddings rather than token-level predictions. Unlike CAUSAL_LM or SEQ_CLS task types, FEATURE_EXTRACTION does not add or modify any head layers -- it only injects adapter weights into the encoder. For bi-encoder training, the typical target modules are the attention projection layers:

target_modules = ["key", "query", "value"]

These correspond to the key, query, and value projection matrices in the transformer's self-attention mechanism. By adapting only these projections, the model learns to produce domain-specific embeddings while keeping the rest of the architecture frozen.

Key Concepts

Bi-Encoder vs. Cross-Encoder: Bi-encoders encode queries and documents independently, enabling precomputation of document embeddings for efficient retrieval. Cross-encoders process query-document pairs jointly but cannot precompute, making them suitable only for reranking.
Mean Pooling: Aggregates variable-length token representations into a fixed-size vector. Alternative pooling strategies include [CLS] token pooling and max pooling, but mean pooling generally produces stronger sentence embeddings.
L2 Normalization: Projects embeddings onto the unit sphere, ensuring that cosine similarity equals the dot product. This simplifies nearest-neighbor search and makes similarity scores more interpretable.
Embedding Space Geometry: Contrastive training shapes the embedding space so that semantically similar items cluster together. The quality of this geometry directly determines retrieval performance.
FEATURE_EXTRACTION Task Type: Signals to PEFT that the model is used for embedding generation, not classification or generation. This ensures adapter layers are injected without modifying the model's output structure.

Practical Implications

Bi-encoder models with LoRA adapters enable efficient domain adaptation for semantic search without retraining the full model
L2-normalized embeddings allow the use of optimized approximate nearest-neighbor libraries (FAISS, ScaNN) for production retrieval
The cosine margin loss naturally handles varying degrees of relevance when labels encode relevance scores rather than binary decisions
At inference time, query encoding is the only online computation -- document embeddings are precomputed and indexed offline
Small LoRA ranks (e.g., r=8) are typically sufficient for embedding adaptation, as the pretrained model already encodes rich semantic structure

Related Pages

Implementation:Huggingface_Peft_Bi_Encoder_Cosine_Loss

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment