Workflow:Princeton nlp SimPO Model Inference

Knowledge Sources	SimPO HuggingFace Transformers Pipelines
Domains	LLMs, Inference
Last Updated	2026-02-08 04:00 GMT

Overview

Simple inference pipeline for generating text responses from a pre-trained SimPO model using the HuggingFace transformers text-generation pipeline.

Description

This workflow demonstrates how to load and use a SimPO-trained model for text generation. It uses the HuggingFace transformers pipeline API to load the model with bfloat16 precision on GPU and generate responses to user prompts formatted as chat messages. The pipeline automatically applies the model's chat template and handles tokenization, generation, and decoding. This is the recommended approach for basic inference with any of the released SimPO model checkpoints.

Key features:

Single-script inference using HuggingFace pipeline API
Automatic chat template application
bfloat16 precision for memory-efficient inference
Compatible with all released SimPO model variants

Usage

Execute this workflow when you need to run inference with a trained SimPO model, either from the released checkpoints on HuggingFace Hub (e.g., princeton-nlp/gemma-2-9b-it-SimPO) or from a locally trained model. This is suitable for quick testing, demo generation, and integration into downstream applications. Requires a single GPU with enough VRAM for the model in bfloat16 (approximately 18GB for 9B parameter models).

Execution Steps

Step 1: Model Loading

Load the SimPO-trained model using the HuggingFace text-generation pipeline. The model is specified by its HuggingFace Hub identifier or local path. Loading uses bfloat16 precision to reduce memory usage and is placed on a CUDA device for GPU-accelerated inference.

Key considerations:

Specify the correct model identifier (Hub ID or local path)
Use torch.bfloat16 for memory-efficient loading
Ensure sufficient GPU VRAM for the model size
The pipeline automatically loads the associated tokenizer and chat template

Step 2: Prompt Formatting

Format the user input as an OpenAI-style chat message list with role and content fields. The pipeline's chat template handling will automatically convert this into the model-specific token format (e.g., Llama-3 or Gemma chat format) including appropriate special tokens and system messages.

Key considerations:

Use OpenAI message format: list of dicts with "role" and "content" keys
For Llama-3 models, ensure only one BOS token is present after template application
The chat template is loaded automatically from the model's tokenizer configuration

Step 3: Text Generation

Run the text-generation pipeline with the formatted prompt to produce the model's response. Generation parameters control the output quality and length. The pipeline returns the full conversation including the generated assistant response.

Key considerations:

Set do_sample=False for deterministic (greedy) generation, or True with temperature for sampling
Control output length with max_new_tokens
The pipeline returns the complete message history including the generated response
For evaluation, ensure generation parameters match the benchmark requirements

Execution Diagram

GitHub URL

Workflow Repository