Workflow:Huggingface Peft LoRA Embedding Semantic Search

Knowledge Sources	Huggingface PEFT PEFT Documentation Transformers Docs
Domains	NLP, Fine_Tuning, Embeddings, Semantic_Search
Last Updated	2026-02-07 06:00 GMT

Overview

End-to-end process for fine-tuning a sentence embedding model for semantic search using LoRA with TaskType.FEATURE_EXTRACTION, training with cosine similarity loss and evaluating with ROC-AUC.

Description

This workflow demonstrates how to apply LoRA adapters to a pre-trained encoder model for the purpose of improving sentence embeddings for semantic search and retrieval tasks. The base encoder model (e.g., BERT, RoBERTa, or a sentence transformer) is wrapped in a custom embedding model that performs mean pooling over token embeddings and L2 normalization to produce fixed-size sentence vectors. LoRA adapters are applied to the attention layers using the FEATURE_EXTRACTION task type. The model is trained with a cosine similarity loss that learns to maximize similarity between matched pairs (e.g., query and relevant product) and minimize similarity between unmatched pairs. Training uses Accelerate for distributed support and evaluates using ROC-AUC metric.

Usage

Execute this workflow when you need to adapt a pre-trained language model to produce better sentence embeddings for a domain-specific semantic search or retrieval task. This is common in e-commerce product search (matching queries to product titles), document retrieval, FAQ matching, or any scenario where you need to compute similarity between text pairs using vector representations.

Execution Steps

Step 1: Load Base Encoder and Build Embedding Model

Load a pre-trained encoder model and tokenizer from the Hugging Face Hub. Wrap the encoder in a custom sentence embedding model that applies mean pooling over the token-level outputs (weighted by attention mask) and L2 normalization to produce unit-length embedding vectors. This custom wrapper is necessary because the base model outputs token-level representations, not sentence-level embeddings.

Key considerations:

The custom wrapper performs mean pooling: average token embeddings weighted by attention mask
L2 normalization ensures all embeddings lie on the unit hypersphere
This enables cosine similarity to be computed as a simple dot product
The wrapper inherits from nn.Module and exposes the underlying model for PEFT injection

Step 2: Apply LoRA for Feature Extraction

Configure LoRA with TaskType.FEATURE_EXTRACTION and apply it to the encoder model using get_peft_model. Target the key and query projection layers in the attention mechanism. Register custom Accelerate save and load hooks for proper PEFT checkpoint handling during distributed training.

Key considerations:

TaskType.FEATURE_EXTRACTION tells PEFT this is an embedding/encoding task
Target modules should match the encoder architecture (e.g., key, query, value for BERT)
Custom save/load hooks ensure only adapter weights are checkpointed
The embedding wrapper's pooling and normalization layers are not modified by LoRA

Step 3: Prepare Paired Training Data

Load the semantic search dataset containing text pairs (e.g., queries and product titles) with relevance labels. Tokenize each text independently to produce separate input tensors for the anchor (query) and positive/negative (product) texts. Create DataLoaders with appropriate batching and shuffling.

Key considerations:

Each training example contains two texts (e.g., query and product title) and a relevance label
Texts are tokenized independently since they will be encoded separately
Prefix query and product texts with descriptive headers if using a bi-encoder approach
Balance positive and negative examples in each batch for stable training

Step 4: Train with Cosine Similarity Loss

Run the training loop using Accelerate. For each batch, encode the anchor and positive/negative texts through the embedding model independently to get their embeddings. Compute cosine similarity between embedding pairs and apply a loss function that pushes matching pairs together and non-matching pairs apart. Support gradient accumulation for effective larger batch sizes.

Key considerations:

Compute embeddings separately for each text in the pair (bi-encoder architecture)
Cosine similarity between unit-normalized vectors equals their dot product
The loss function penalizes low similarity for positive pairs and high similarity for negative pairs
Gradient accumulation allows simulating larger batch sizes which improves contrastive learning

Step 5: Evaluate with ROC_AUC

After each training epoch, evaluate the model on the validation set by computing embeddings for all text pairs, calculating cosine similarities, and measuring ROC-AUC (Area Under the Receiver Operating Characteristic Curve). ROC-AUC measures how well the model discriminates between relevant and irrelevant pairs across all similarity thresholds.

Key considerations:

ROC-AUC is threshold-independent and measures ranking quality
Gather predictions across distributed processes before computing the metric
Higher ROC-AUC indicates better discrimination between relevant and irrelevant pairs
Save the model checkpoint when the best validation ROC-AUC is achieved

Step 6: Save Adapter and Deploy

Save the trained LoRA adapter weights. At inference time, load the base encoder, apply the adapter, wrap in the embedding model, and use it to encode queries and documents into embeddings for similarity-based retrieval. The adapter can be pushed to the Hugging Face Hub for sharing.

Key considerations:

The adapter is a small checkpoint that adapts the encoder for the specific search domain
At inference, encode queries and documents into embeddings, then compute cosine similarity
Build a vector index (e.g., FAISS) over document embeddings for efficient retrieval
Multiple domain-specific adapters can be swapped for different search verticals

Execution Diagram

GitHub URL

Workflow Repository