Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Princeton nlp SimPO Model Inference

From Leeroopedia


Knowledge Sources
Domains LLMs, Inference
Last Updated 2026-02-08 04:00 GMT

Overview

Simple inference pipeline for generating text responses from a pre-trained SimPO model using the HuggingFace transformers text-generation pipeline.

Description

This workflow demonstrates how to load and use a SimPO-trained model for text generation. It uses the HuggingFace transformers pipeline API to load the model with bfloat16 precision on GPU and generate responses to user prompts formatted as chat messages. The pipeline automatically applies the model's chat template and handles tokenization, generation, and decoding. This is the recommended approach for basic inference with any of the released SimPO model checkpoints.

Key features:

  • Single-script inference using HuggingFace pipeline API
  • Automatic chat template application
  • bfloat16 precision for memory-efficient inference
  • Compatible with all released SimPO model variants

Usage

Execute this workflow when you need to run inference with a trained SimPO model, either from the released checkpoints on HuggingFace Hub (e.g., princeton-nlp/gemma-2-9b-it-SimPO) or from a locally trained model. This is suitable for quick testing, demo generation, and integration into downstream applications. Requires a single GPU with enough VRAM for the model in bfloat16 (approximately 18GB for 9B parameter models).

Execution Steps

Step 1: Model Loading

Load the SimPO-trained model using the HuggingFace text-generation pipeline. The model is specified by its HuggingFace Hub identifier or local path. Loading uses bfloat16 precision to reduce memory usage and is placed on a CUDA device for GPU-accelerated inference.

Key considerations:

  • Specify the correct model identifier (Hub ID or local path)
  • Use torch.bfloat16 for memory-efficient loading
  • Ensure sufficient GPU VRAM for the model size
  • The pipeline automatically loads the associated tokenizer and chat template

Step 2: Prompt Formatting

Format the user input as an OpenAI-style chat message list with role and content fields. The pipeline's chat template handling will automatically convert this into the model-specific token format (e.g., Llama-3 or Gemma chat format) including appropriate special tokens and system messages.

Key considerations:

  • Use OpenAI message format: list of dicts with "role" and "content" keys
  • For Llama-3 models, ensure only one BOS token is present after template application
  • The chat template is loaded automatically from the model's tokenizer configuration

Step 3: Text Generation

Run the text-generation pipeline with the formatted prompt to produce the model's response. Generation parameters control the output quality and length. The pipeline returns the full conversation including the generated assistant response.

Key considerations:

  • Set do_sample=False for deterministic (greedy) generation, or True with temperature for sampling
  • Control output length with max_new_tokens
  • The pipeline returns the complete message history including the generated response
  • For evaluation, ensure generation parameters match the benchmark requirements

Execution Diagram

GitHub URL

Workflow Repository