Implementation:PacktPublishing LLM Engineers Handbook VLLM LLM Generate
Overview
VLLM LLM Generate implements the Principle:PacktPublishing_LLM_Engineers_Handbook_Batch_Inference_Generation principle by using vLLM's LLM class and SamplingParams to efficiently generate answers from a fine-tuned model across an entire test dataset, then persisting the results to HuggingFace Hub.
| Aspect | Detail |
|---|---|
| Implementation Name | VLLM LLM Generate |
| Workflow | Model_Evaluation |
| Type | Wrapper Doc (vLLM) |
| Source File | llm_engineering/model/evaluation/evaluate.py (Lines 25–52) |
| Implements | Principle:PacktPublishing_LLM_Engineers_Handbook_Batch_Inference_Generation |
API Signature
def generate_answers(model_id: str, dataset_name: str) -> Dataset
Internally uses:
LLM(model=model_id, max_model_len=2048)
llm.generate(prompts, SamplingParams(...))
Key Code
def generate_answers(model_id: str, dataset_name: str) -> Dataset:
dataset = load_dataset(dataset_name, split="test")
prompts = [row["instruction"] for row in dataset]
llm = LLM(model=model_id, max_model_len=2048)
sampling_params = SamplingParams(
temperature=0.8,
top_p=0.95,
min_p=0.05,
max_tokens=2048,
)
outputs = llm.generate(prompts, sampling_params)
answers = [output.outputs[0].text for output in outputs]
dataset = dataset.add_column("answers", answers)
dataset.push_to_hub(
f"{MODEL_HUGGINGFACE_WORKSPACE}/{model_id.split('/')[-1]}-results"
)
return dataset
Imports
from vllm import LLM, SamplingParams
from datasets import load_dataset
Inputs
| Parameter | Type | Description |
|---|---|---|
model_id |
str |
HuggingFace Hub model identifier (e.g., "pauliusztin/llm-twin-7b"). Must point to a valid model with compatible architecture.
|
dataset_name |
str |
HuggingFace Hub dataset identifier containing the test split with an "instruction" column.
|
Outputs
| Return | Type | Description |
|---|---|---|
| Augmented dataset | Dataset |
The original test dataset with an added "answers" column containing generated text from the model
|
Side effect: The augmented dataset is pushed to HuggingFace Hub at {workspace}/{model_name}-results.
Sampling Parameters
| Parameter | Value | Purpose |
|---|---|---|
temperature |
0.8 |
Controls randomness. Moderate temperature allows some diversity while remaining coherent. |
top_p |
0.95 |
Nucleus sampling threshold. Considers tokens comprising the top 95% probability mass. |
min_p |
0.05 |
Minimum probability threshold. Filters out tokens with less than 5% of the top token's probability. |
max_tokens |
2048 |
Maximum number of tokens to generate per response. |
Step-by-Step Behavior
- Load dataset: The test split of the specified dataset is loaded from HuggingFace Hub
- Extract prompts: The
"instruction"column is extracted as a list of prompt strings - Initialize vLLM engine: An
LLMinstance is created with the specified model and amax_model_lenof 2048. vLLM automatically handles model weight loading, KV cache allocation with PagedAttention, and continuous batching setup. - Generate answers: All prompts are submitted to
llm.generate()in a single call. vLLM internally schedules them using continuous batching for maximum GPU utilization. - Extract text: The first output sequence's text is extracted from each
RequestOutputobject - Augment dataset: The generated answers are added as a new
"answers"column to the dataset - Push to Hub: The augmented dataset is uploaded to HuggingFace Hub under a name derived from the model ID (appending
-results) - Return: The augmented dataset is returned for potential downstream use in the same process
External Dependencies
| Dependency | Purpose |
|---|---|
vllm |
High-throughput inference engine with PagedAttention and continuous batching |
datasets |
HuggingFace Datasets library for loading and pushing datasets |
External Reference
- vLLM Documentation
- vLLM: Efficient Memory Management for Large Language Model Serving with PagedAttention
Performance Notes
vLLM's generate() method with a list of prompts leverages continuous batching automatically. Compared to a naive loop using HuggingFace model.generate():
- Throughput: Typically 5–24x higher due to continuous batching and PagedAttention
- Memory efficiency: PagedAttention reduces KV cache waste, allowing larger effective batch sizes
- Simplicity: A single
llm.generate(prompts, params)call replaces complex batching logic
See Also
- Principle:PacktPublishing_LLM_Engineers_Handbook_Batch_Inference_Generation — the principle this implements
- Implementation:PacktPublishing_LLM_Engineers_Handbook_HfApi_Model_Info — the upstream validation that provides the model_id
- Implementation:PacktPublishing_LLM_Engineers_Handbook_OpenAI_Chat_Completions — the downstream scoring step that evaluates the generated answers
- Environment:PacktPublishing_LLM_Engineers_Handbook_VLLM_Evaluation_Environment
- Heuristic:PacktPublishing_LLM_Engineers_Handbook_Temperature_Selection_By_Task