Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Lm sys FastChat Huggingface API Inference

From Leeroopedia
Revision as of 15:34, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Lm_sys_FastChat_Huggingface_API_Inference.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Inference, LLM, HuggingFace
Last Updated 2026-02-07 06:00 GMT

Overview

Simple HuggingFace-based inference script that demonstrates loading a FastChat model, building a conversation prompt, and running text generation.

Description

The huggingface_api module provides a self-contained inference pipeline using HuggingFace's generation APIs with FastChat's model loading and conversation templating utilities. The core main function is decorated with @torch.inference_mode() to disable gradient computation and reduce memory usage during generation.

The pipeline follows three stages. First, it loads the model and tokenizer using FastChat's load_model function, which supports multi-GPU distribution, 8-bit quantization, and CPU offloading. Second, it builds the conversation prompt by obtaining the appropriate conversation template for the loaded model via get_conversation_template, appending the user message and a None placeholder for the assistant response, and calling conv.get_prompt() to produce the formatted input string. Third, it tokenizes the prompt, runs model.generate() with configurable temperature, repetition penalty, and max token count, and decodes the output.

The module handles both encoder-decoder models (like T5) and decoder-only models (like LLaMA/Vicuna) differently during output extraction. For decoder-only models, it strips the input tokens from the output by slicing output_ids[0][len(inputs["input_ids"][0]):]. For encoder-decoder models, it takes the full output directly. The module also automatically adjusts the repetition penalty to 1.2 for T5-based models when the default of 1.0 is detected.

Usage

Use this module as a quick way to test FastChat-compatible models locally without starting the full serving infrastructure. It is ideal for validating that a model loads correctly, verifying conversation template formatting, and performing one-off generation tests. Run it via python3 -m fastchat.serve.huggingface_api --model lmsys/vicuna-7b-v1.5.

Code Reference

Source Location

Signature

@torch.inference_mode()
def main(args: argparse.Namespace) -> None:
    """Loads model, builds conversation prompt, and runs generation."""
    ...

Import

from fastchat.serve.huggingface_api import main

I/O Contract

Inputs

Name Type Required Description
args.model_path str Yes Path or HuggingFace model ID for the model to load (e.g., lmsys/vicuna-7b-v1.5)
args.message str No User message to send to the model, defaults to "Hello! Who are you?"
args.temperature float No Sampling temperature for generation, defaults to 0.7. Values <= 1e-5 trigger greedy decoding
args.repetition_penalty float No Repetition penalty factor, defaults to 1.0 (1.2 for T5 models)
args.max_new_tokens int No Maximum number of new tokens to generate, defaults to 1024
args.device str No Device for model placement (e.g., "cuda", "cpu")
args.num_gpus int No Number of GPUs for model distribution
args.max_gpu_memory str No Maximum GPU memory allocation per device
args.load_8bit bool No Whether to load model in 8-bit quantization
args.cpu_offloading bool No Whether to offload layers to CPU
args.debug bool No Enable debug logging

Outputs

Name Type Description
stdout str Prints the user message prefixed with the user role and the generated response prefixed with the assistant role

Usage Examples

# Command-line usage with Vicuna
# python3 -m fastchat.serve.huggingface_api --model lmsys/vicuna-7b-v1.5

# Command-line usage with T5
# python3 -m fastchat.serve.huggingface_api --model lmsys/fastchat-t5-3b-v1.0

# Custom message with temperature
# python3 -m fastchat.serve.huggingface_api \
#     --model lmsys/vicuna-7b-v1.5 \
#     --message "Explain quantum computing in simple terms" \
#     --temperature 0.3 \
#     --max-new-tokens 512

# Programmatic usage
import argparse
from fastchat.serve.huggingface_api import main

args = argparse.Namespace(
    model_path="lmsys/vicuna-7b-v1.5",
    device="cuda",
    num_gpus=1,
    max_gpu_memory=None,
    load_8bit=False,
    cpu_offloading=False,
    revision="main",
    debug=False,
    message="What is the capital of France?",
    temperature=0.7,
    repetition_penalty=1.0,
    max_new_tokens=512,
)
main(args)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment