Principle:Huggingface Peft Cosine Similarity Contrastive Training
Metadata
- Sources: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, LoRA: Low-Rank Adaptation of Large Language Models
- Domains: NLP, Semantic_Search
Overview
Cosine Similarity Contrastive Training is a training paradigm for learning dense text embeddings suitable for semantic search. It employs a bi-encoder architecture where queries and documents are independently encoded into fixed-size vectors, and their semantic similarity is measured using cosine similarity. A contrastive loss function trains the model so that semantically related pairs have high cosine similarity while unrelated pairs have low similarity. When combined with PEFT adapters (specifically LoRA with FEATURE_EXTRACTION task type), this approach enables efficient adaptation of pretrained language models for domain-specific semantic search.
Theoretical Foundation
Bi-Encoder Architecture
In a bi-encoder setup, both the query and the document pass through the same encoder model (with shared weights) to produce dense vector representations:
e_q = Encoder(query) # query embedding
e_d = Encoder(document) # document embedding
The encoder typically consists of a pretrained transformer followed by a mean pooling layer that aggregates token-level representations into a single vector, and an optional L2 normalization step that projects the embedding onto the unit hypersphere:
mean_pool(H, mask) = sum(H * mask_expanded, dim=1) / clamp(sum(mask_expanded, dim=1), min=1e-9)
embedding = L2_normalize(mean_pool(model_output, attention_mask))
Mean pooling weighted by the attention mask ensures that padding tokens do not contribute to the final representation. L2 normalization ensures that cosine similarity reduces to a simple dot product, improving numerical stability and search efficiency.
Cosine Similarity
Cosine similarity measures the angle between two vectors in embedding space, independent of their magnitudes:
cos_sim(e_q, e_d) = (e_q . e_d) / (||e_q|| * ||e_d||)
When vectors are L2-normalized (as in the bi-encoder), this simplifies to the dot product:
cos_sim(e_q, e_d) = sum(e_q * e_d)
Values range from -1 (opposite) to 1 (identical direction), with 0 indicating orthogonality (no semantic relationship).
Contrastive Loss
The training loss encourages high cosine similarity for positive pairs (relevant query-document) and low similarity for negative pairs (irrelevant query-document). A common formulation is the cosine margin loss:
L = mean( labels * (1 - cos_sim)^2 + (1 - labels) * max(cos_sim, 0)^2 )
where:
labels = 1for positive pairs: the loss penalizes low cosine similaritylabels = 0for negative pairs: the loss penalizes positive cosine similarity (clamped at 0 to avoid penalizing already-negative similarities)
This loss has a margin-like behavior: negative pairs are only penalized when their cosine similarity is above zero, encouraging clear separation in the embedding space.
LoRA for Feature Extraction
PEFT's TaskType.FEATURE_EXTRACTION is designed for models that produce embeddings rather than token-level predictions. Unlike CAUSAL_LM or SEQ_CLS task types, FEATURE_EXTRACTION does not add or modify any head layers -- it only injects adapter weights into the encoder. For bi-encoder training, the typical target modules are the attention projection layers:
target_modules = ["key", "query", "value"]
These correspond to the key, query, and value projection matrices in the transformer's self-attention mechanism. By adapting only these projections, the model learns to produce domain-specific embeddings while keeping the rest of the architecture frozen.
Key Concepts
- Bi-Encoder vs. Cross-Encoder: Bi-encoders encode queries and documents independently, enabling precomputation of document embeddings for efficient retrieval. Cross-encoders process query-document pairs jointly but cannot precompute, making them suitable only for reranking.
- Mean Pooling: Aggregates variable-length token representations into a fixed-size vector. Alternative pooling strategies include [CLS] token pooling and max pooling, but mean pooling generally produces stronger sentence embeddings.
- L2 Normalization: Projects embeddings onto the unit sphere, ensuring that cosine similarity equals the dot product. This simplifies nearest-neighbor search and makes similarity scores more interpretable.
- Embedding Space Geometry: Contrastive training shapes the embedding space so that semantically similar items cluster together. The quality of this geometry directly determines retrieval performance.
- FEATURE_EXTRACTION Task Type: Signals to PEFT that the model is used for embedding generation, not classification or generation. This ensures adapter layers are injected without modifying the model's output structure.
Practical Implications
- Bi-encoder models with LoRA adapters enable efficient domain adaptation for semantic search without retraining the full model
- L2-normalized embeddings allow the use of optimized approximate nearest-neighbor libraries (FAISS, ScaNN) for production retrieval
- The cosine margin loss naturally handles varying degrees of relevance when labels encode relevance scores rather than binary decisions
- At inference time, query encoding is the only online computation -- document embeddings are precomputed and indexed offline
- Small LoRA ranks (e.g., r=8) are typically sufficient for embedding adaptation, as the pretrained model already encodes rich semantic structure