Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:LLMBook zh LLMBook zh github io VLLM LLM Generate

From Leeroopedia


Knowledge Sources
Domains NLP, Inference, Systems
Last Updated 2026-02-08 00:00 GMT

Overview

Concrete tool for high-throughput LLM inference using vLLM's batch generation engine, as used in the LLMBook repository.

Description

vllm.LLM initializes the inference engine, and LLM.generate performs batch generation with configurable SamplingParams. The repository demonstrates using vLLM with LLaMA-2 Chat format prompts, greedy decoding (temperature=0), and a 2048-token maximum output.

This is a Wrapper Doc documenting how the LLMBook repository uses the vLLM library.

Usage

Use this for batch inference with LLMs when you need high throughput and efficient memory management.

Code Reference

Source Location

  • Repository: LLMBook-zh
  • File: code/9.1 vLLM实践.py
  • Lines: 1-24

Signature

# Initialize vLLM engine
model = vllm.LLM(model: str)

# Configure sampling
sampling_params = vllm.SamplingParams(
    temperature: float = 0,
    max_tokens: int = 2048,
    presence_penalty: float = 0,
    frequency_penalty: float = 0,
)

# Generate
outputs = model.generate(
    prompts: list[str],
    sampling_params: SamplingParams
) -> list[RequestOutput]

Import

import vllm

External Reference

I/O Contract

Inputs

Name Type Required Description
model str Yes HuggingFace model ID (e.g., "meta-llama/Llama-2-7b-chat-hf")
prompts list[str] Yes List of prompt strings
sampling_params SamplingParams Yes Decoding configuration

Outputs

Name Type Description
outputs list[RequestOutput] Generated outputs; text via output.outputs[0].text

Usage Examples

import vllm

# Initialize
model = vllm.LLM(model="meta-llama/Llama-2-7b-chat-hf")

# Configure greedy decoding
sampling_params = vllm.SamplingParams(
    temperature=0,
    max_tokens=2048,
)

# Generate
prompts = [
    "[INST] How are you? [/INST]",
    "[INST] 1 + 1 = ? [/INST]",
]
outputs = model.generate(prompts, sampling_params=sampling_params)

for prompt, output in zip(prompts, outputs):
    print(f"Input: {prompt}")
    print(f"Output: {output.outputs[0].text}")

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment