Implementation:Lm sys FastChat Huggingface API Inference

Knowledge Sources	Lm_sys_FastChat
Domains	Inference, LLM, HuggingFace
Last Updated	2026-02-07 06:00 GMT

Overview

Simple HuggingFace-based inference script that demonstrates loading a FastChat model, building a conversation prompt, and running text generation.

Description

The huggingface_api module provides a self-contained inference pipeline using HuggingFace's generation APIs with FastChat's model loading and conversation templating utilities. The core main function is decorated with @torch.inference_mode() to disable gradient computation and reduce memory usage during generation.

The pipeline follows three stages. First, it loads the model and tokenizer using FastChat's load_model function, which supports multi-GPU distribution, 8-bit quantization, and CPU offloading. Second, it builds the conversation prompt by obtaining the appropriate conversation template for the loaded model via get_conversation_template, appending the user message and a None placeholder for the assistant response, and calling conv.get_prompt() to produce the formatted input string. Third, it tokenizes the prompt, runs model.generate() with configurable temperature, repetition penalty, and max token count, and decodes the output.

The module handles both encoder-decoder models (like T5) and decoder-only models (like LLaMA/Vicuna) differently during output extraction. For decoder-only models, it strips the input tokens from the output by slicing output_ids[0][len(inputs["input_ids"][0]):]. For encoder-decoder models, it takes the full output directly. The module also automatically adjusts the repetition penalty to 1.2 for T5-based models when the default of 1.0 is detected.

Usage

Use this module as a quick way to test FastChat-compatible models locally without starting the full serving infrastructure. It is ideal for validating that a model loads correctly, verifying conversation template formatting, and performing one-off generation tests. Run it via python3 -m fastchat.serve.huggingface_api --model lmsys/vicuna-7b-v1.5.

Code Reference

Source Location

Repository: Lm_sys_FastChat
File: fastchat/serve/huggingface_api.py
Lines: 1-73

Signature

@torch.inference_mode()
def main(args: argparse.Namespace) -> None:
    """Loads model, builds conversation prompt, and runs generation."""
    ...

Import

from fastchat.serve.huggingface_api import main

I/O Contract

Inputs

Name	Type	Required	Description
args.model_path	str	Yes	Path or HuggingFace model ID for the model to load (e.g., lmsys/vicuna-7b-v1.5)
args.message	str	No	User message to send to the model, defaults to "Hello! Who are you?"
args.temperature	float	No	Sampling temperature for generation, defaults to 0.7. Values <= 1e-5 trigger greedy decoding
args.repetition_penalty	float	No	Repetition penalty factor, defaults to 1.0 (1.2 for T5 models)
args.max_new_tokens	int	No	Maximum number of new tokens to generate, defaults to 1024
args.device	str	No	Device for model placement (e.g., "cuda", "cpu")
args.num_gpus	int	No	Number of GPUs for model distribution
args.max_gpu_memory	str	No	Maximum GPU memory allocation per device
args.load_8bit	bool	No	Whether to load model in 8-bit quantization
args.cpu_offloading	bool	No	Whether to offload layers to CPU
args.debug	bool	No	Enable debug logging

Outputs

Name	Type	Description
stdout	str	Prints the user message prefixed with the user role and the generated response prefixed with the assistant role

Usage Examples

# Command-line usage with Vicuna
# python3 -m fastchat.serve.huggingface_api --model lmsys/vicuna-7b-v1.5

# Command-line usage with T5
# python3 -m fastchat.serve.huggingface_api --model lmsys/fastchat-t5-3b-v1.0

# Custom message with temperature
# python3 -m fastchat.serve.huggingface_api \
#     --model lmsys/vicuna-7b-v1.5 \
#     --message "Explain quantum computing in simple terms" \
#     --temperature 0.3 \
#     --max-new-tokens 512

# Programmatic usage
import argparse
from fastchat.serve.huggingface_api import main

args = argparse.Namespace(
    model_path="lmsys/vicuna-7b-v1.5",
    device="cuda",
    num_gpus=1,
    max_gpu_memory=None,
    load_8bit=False,
    cpu_offloading=False,
    revision="main",
    debug=False,
    message="What is the capital of France?",
    temperature=0.7,
    repetition_penalty=1.0,
    max_new_tokens=512,
)
main(args)

Related Pages

Principle:Lm_sys_FastChat_HuggingFace_Pipeline_Inference
Implements: Principle:Lm_sys_FastChat_HuggingFace_Pipeline_Inference
Lm_sys_FastChat_Apply_Delta_Weights - Reconstructs model weights needed before inference
Lm_sys_FastChat_Remote_Logger - Logging infrastructure used by the serving components
Lm_sys_FastChat_Condense_Rotary_Embedding - Context extension that can be applied before loading the model

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment