Workflow:Princeton nlp SimPO Model Inference
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Inference |
| Last Updated | 2026-02-08 04:00 GMT |
Overview
Simple inference pipeline for generating text responses from a pre-trained SimPO model using the HuggingFace transformers text-generation pipeline.
Description
This workflow demonstrates how to load and use a SimPO-trained model for text generation. It uses the HuggingFace transformers pipeline API to load the model with bfloat16 precision on GPU and generate responses to user prompts formatted as chat messages. The pipeline automatically applies the model's chat template and handles tokenization, generation, and decoding. This is the recommended approach for basic inference with any of the released SimPO model checkpoints.
Key features:
- Single-script inference using HuggingFace pipeline API
- Automatic chat template application
- bfloat16 precision for memory-efficient inference
- Compatible with all released SimPO model variants
Usage
Execute this workflow when you need to run inference with a trained SimPO model, either from the released checkpoints on HuggingFace Hub (e.g., princeton-nlp/gemma-2-9b-it-SimPO) or from a locally trained model. This is suitable for quick testing, demo generation, and integration into downstream applications. Requires a single GPU with enough VRAM for the model in bfloat16 (approximately 18GB for 9B parameter models).
Execution Steps
Step 1: Model Loading
Load the SimPO-trained model using the HuggingFace text-generation pipeline. The model is specified by its HuggingFace Hub identifier or local path. Loading uses bfloat16 precision to reduce memory usage and is placed on a CUDA device for GPU-accelerated inference.
Key considerations:
- Specify the correct model identifier (Hub ID or local path)
- Use torch.bfloat16 for memory-efficient loading
- Ensure sufficient GPU VRAM for the model size
- The pipeline automatically loads the associated tokenizer and chat template
Step 2: Prompt Formatting
Format the user input as an OpenAI-style chat message list with role and content fields. The pipeline's chat template handling will automatically convert this into the model-specific token format (e.g., Llama-3 or Gemma chat format) including appropriate special tokens and system messages.
Key considerations:
- Use OpenAI message format: list of dicts with "role" and "content" keys
- For Llama-3 models, ensure only one BOS token is present after template application
- The chat template is loaded automatically from the model's tokenizer configuration
Step 3: Text Generation
Run the text-generation pipeline with the formatted prompt to produce the model's response. Generation parameters control the output quality and length. The pipeline returns the full conversation including the generated assistant response.
Key considerations:
- Set do_sample=False for deterministic (greedy) generation, or True with temperature for sampling
- Control output length with max_new_tokens
- The pipeline returns the complete message history including the generated response
- For evaluation, ensure generation parameters match the benchmark requirements