Implementation:Lm sys FastChat Huggingface API Inference
| Knowledge Sources | |
|---|---|
| Domains | Inference, LLM, HuggingFace |
| Last Updated | 2026-02-07 06:00 GMT |
Overview
Simple HuggingFace-based inference script that demonstrates loading a FastChat model, building a conversation prompt, and running text generation.
Description
The huggingface_api module provides a self-contained inference pipeline using HuggingFace's generation APIs with FastChat's model loading and conversation templating utilities. The core main function is decorated with @torch.inference_mode() to disable gradient computation and reduce memory usage during generation.
The pipeline follows three stages. First, it loads the model and tokenizer using FastChat's load_model function, which supports multi-GPU distribution, 8-bit quantization, and CPU offloading. Second, it builds the conversation prompt by obtaining the appropriate conversation template for the loaded model via get_conversation_template, appending the user message and a None placeholder for the assistant response, and calling conv.get_prompt() to produce the formatted input string. Third, it tokenizes the prompt, runs model.generate() with configurable temperature, repetition penalty, and max token count, and decodes the output.
The module handles both encoder-decoder models (like T5) and decoder-only models (like LLaMA/Vicuna) differently during output extraction. For decoder-only models, it strips the input tokens from the output by slicing output_ids[0][len(inputs["input_ids"][0]):]. For encoder-decoder models, it takes the full output directly. The module also automatically adjusts the repetition penalty to 1.2 for T5-based models when the default of 1.0 is detected.
Usage
Use this module as a quick way to test FastChat-compatible models locally without starting the full serving infrastructure. It is ideal for validating that a model loads correctly, verifying conversation template formatting, and performing one-off generation tests. Run it via python3 -m fastchat.serve.huggingface_api --model lmsys/vicuna-7b-v1.5.
Code Reference
Source Location
- Repository: Lm_sys_FastChat
- File: fastchat/serve/huggingface_api.py
- Lines: 1-73
Signature
@torch.inference_mode()
def main(args: argparse.Namespace) -> None:
"""Loads model, builds conversation prompt, and runs generation."""
...
Import
from fastchat.serve.huggingface_api import main
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| args.model_path | str | Yes | Path or HuggingFace model ID for the model to load (e.g., lmsys/vicuna-7b-v1.5) |
| args.message | str | No | User message to send to the model, defaults to "Hello! Who are you?" |
| args.temperature | float | No | Sampling temperature for generation, defaults to 0.7. Values <= 1e-5 trigger greedy decoding |
| args.repetition_penalty | float | No | Repetition penalty factor, defaults to 1.0 (1.2 for T5 models) |
| args.max_new_tokens | int | No | Maximum number of new tokens to generate, defaults to 1024 |
| args.device | str | No | Device for model placement (e.g., "cuda", "cpu") |
| args.num_gpus | int | No | Number of GPUs for model distribution |
| args.max_gpu_memory | str | No | Maximum GPU memory allocation per device |
| args.load_8bit | bool | No | Whether to load model in 8-bit quantization |
| args.cpu_offloading | bool | No | Whether to offload layers to CPU |
| args.debug | bool | No | Enable debug logging |
Outputs
| Name | Type | Description |
|---|---|---|
| stdout | str | Prints the user message prefixed with the user role and the generated response prefixed with the assistant role |
Usage Examples
# Command-line usage with Vicuna
# python3 -m fastchat.serve.huggingface_api --model lmsys/vicuna-7b-v1.5
# Command-line usage with T5
# python3 -m fastchat.serve.huggingface_api --model lmsys/fastchat-t5-3b-v1.0
# Custom message with temperature
# python3 -m fastchat.serve.huggingface_api \
# --model lmsys/vicuna-7b-v1.5 \
# --message "Explain quantum computing in simple terms" \
# --temperature 0.3 \
# --max-new-tokens 512
# Programmatic usage
import argparse
from fastchat.serve.huggingface_api import main
args = argparse.Namespace(
model_path="lmsys/vicuna-7b-v1.5",
device="cuda",
num_gpus=1,
max_gpu_memory=None,
load_8bit=False,
cpu_offloading=False,
revision="main",
debug=False,
message="What is the capital of France?",
temperature=0.7,
repetition_penalty=1.0,
max_new_tokens=512,
)
main(args)
Related Pages
- Principle:Lm_sys_FastChat_HuggingFace_Pipeline_Inference
- Implements: Principle:Lm_sys_FastChat_HuggingFace_Pipeline_Inference
- Lm_sys_FastChat_Apply_Delta_Weights - Reconstructs model weights needed before inference
- Lm_sys_FastChat_Remote_Logger - Logging infrastructure used by the serving components
- Lm_sys_FastChat_Condense_Rotary_Embedding - Context extension that can be applied before loading the model