Implementation:LLMBook zh LLMBook zh github io VLLM LLM Generate
Appearance
| Knowledge Sources | |
|---|---|
| Domains | NLP, Inference, Systems |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for high-throughput LLM inference using vLLM's batch generation engine, as used in the LLMBook repository.
Description
vllm.LLM initializes the inference engine, and LLM.generate performs batch generation with configurable SamplingParams. The repository demonstrates using vLLM with LLaMA-2 Chat format prompts, greedy decoding (temperature=0), and a 2048-token maximum output.
This is a Wrapper Doc documenting how the LLMBook repository uses the vLLM library.
Usage
Use this for batch inference with LLMs when you need high throughput and efficient memory management.
Code Reference
Source Location
- Repository: LLMBook-zh
- File: code/9.1 vLLM实践.py
- Lines: 1-24
Signature
# Initialize vLLM engine
model = vllm.LLM(model: str)
# Configure sampling
sampling_params = vllm.SamplingParams(
temperature: float = 0,
max_tokens: int = 2048,
presence_penalty: float = 0,
frequency_penalty: float = 0,
)
# Generate
outputs = model.generate(
prompts: list[str],
sampling_params: SamplingParams
) -> list[RequestOutput]
Import
import vllm
External Reference
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | str | Yes | HuggingFace model ID (e.g., "meta-llama/Llama-2-7b-chat-hf") |
| prompts | list[str] | Yes | List of prompt strings |
| sampling_params | SamplingParams | Yes | Decoding configuration |
Outputs
| Name | Type | Description |
|---|---|---|
| outputs | list[RequestOutput] | Generated outputs; text via output.outputs[0].text |
Usage Examples
import vllm
# Initialize
model = vllm.LLM(model="meta-llama/Llama-2-7b-chat-hf")
# Configure greedy decoding
sampling_params = vllm.SamplingParams(
temperature=0,
max_tokens=2048,
)
# Generate
prompts = [
"[INST] How are you? [/INST]",
"[INST] 1 + 1 = ? [/INST]",
]
outputs = model.generate(prompts, sampling_params=sampling_params)
for prompt, output in zip(prompts, outputs):
print(f"Input: {prompt}")
print(f"Output: {output.outputs[0].text}")
Related Pages
Implements Principle
Requires Environment
- Environment:LLMBook_zh_LLMBook_zh_github_io_PyTorch_CUDA_GPU_Environment
- Environment:LLMBook_zh_LLMBook_zh_github_io_VLLM_Inference_Environment
Uses Heuristic
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment