Principle:Microsoft BIPIA Response Generation

Field	Value
sources	BIPIA: Benchmarking Indirect Prompt Injection Attacks
domains	NLP, Inference, Benchmarking
last_updated	2026-02-14

Overview

A batched inference pipeline pattern that generates LLM responses from poisoned prompts using DataLoader-based batching, with support for resumable execution and multi-backend generation.

Description

The response generation principle defines a structured inference pipeline for running large language models over adversarially constructed (poisoned) benchmark datasets. The core approach loads a processed dataset into a PyTorch DataLoader, iterates over the data in configurable batches, invokes model.generate() for each batch, and writes results incrementally to JSONL files on disk.

The pipeline begins by constructing the poisoned dataset through an AutoPIABuilder, which combines benign context data with injected attack instructions. A model-specific process_fn then formats each example into the prompt structure expected by the target LLM (e.g., chat messages for GPT, tokenized input_ids for HuggingFace models, or templated strings for vLLM). The formatted dataset is handed to a DataLoader with an appropriate data collator for batching.

During generation, the pipeline supports three distinct model backends:

GPT (API-based): Sends chat completion or completion requests sequentially per message through the OpenAI API. Handles rate limiting, timeouts, and service unavailability with automatic retry logic.
HuggingFace (local GPU): Performs batched tensor-based generation using AutoModelForCausalLM with configurable generation parameters (do_sample=False for deterministic greedy decoding, max_new_tokens=512). Supports LoRA adapters, delta weight merging, and 8-bit quantization.
vLLM (tensor-parallel): Uses the vLLM engine with SamplingParams (temperature=0, max_tokens=2048) for high-throughput batched text generation across multiple GPUs via tensor parallelism.

The pipeline supports resume via tracking already-processed messages. When resuming, existing JSONL output is read, processed messages are identified by their content, and the dataset is filtered to exclude completed samples before continuing generation.

Usage

Use this principle when running large-scale LLM inference over poisoned benchmark datasets from the BIPIA benchmark suite. The batched approach handles GPU memory constraints by processing examples in fixed-size batches rather than loading the full dataset into memory at once. The resume mechanism enables recovery from API rate limits, GPU out-of-memory crashes, or network interruptions without re-processing already completed samples. The multi-backend design allows the same pipeline logic to work across cloud API models (GPT-3.5, GPT-4), locally hosted HuggingFace models (Alpaca, Vicuna, Llama2, etc.), and vLLM-accelerated models (Dolly, StableLM, MPT, Mistral).

Theoretical Basis

The inference pipeline is grounded in batched inference -- a standard practice for running neural network forward passes efficiently by grouping multiple inputs together.

Data Collation: The DataLoader uses one of two collator strategies depending on the model backend:

DefaultDataCollator: Used for API-based models (GPT) and vLLM models where inputs are strings or message lists. Simply groups examples into a batch dictionary without padding.
DataCollatorWithPadding: Used for HuggingFace transformer models where inputs are tokenized into integer sequences (input_ids). Pads shorter sequences in a batch to the length of the longest sequence, enabling parallel tensor computation on GPU. Padding is applied on the left side (padding_side="left") so that generation can proceed from the rightmost token.

Generation Parameters: Deterministic output is achieved through temperature=0 (for API and vLLM backends) or do_sample=False (for HuggingFace backend). This ensures reproducible results across runs. The max_new_tokens parameter controls the maximum output length (512 for HuggingFace, 2048 for vLLM/GPT). Stopping criteria based on conversation template stop strings ensure outputs are properly terminated.

Resumable Streaming Output: Results are written incrementally to JSONL files at configurable intervals (log_steps). On resume, the pipeline reads existing output, builds a set of already-processed messages, and filters the dataset accordingly. This provides fault tolerance without checkpointing model state.

Pseudocode for the inference loop:

# Build poisoned dataset
pia_builder = AutoPIABuilder.from_name(dataset_name)(seed)
pia_samples = pia_builder(context_data_file, attack_data_file)
pia_dataset = Dataset.from_pandas(pia_samples)

# Load LLM and apply model-specific prompt formatting
llm = AutoLLM.from_name(llm_config_file)(config=llm_config_file, accelerator=accelerator)
processed_datasets = pia_dataset.map(llm.process_fn)

# Resume filtering: remove already-processed messages
if resume and output_path.exists():
    exist_messages = load_existing_messages(output_path)
    processed_datasets = processed_datasets.filter(lambda ex: ex["message"] not in exist_messages)

# Choose collator based on model type
if "input_ids" in processed_datasets.column_names:
    collator = DataCollatorWithPadding(llm.tokenizer)
else:
    collator = DefaultDataCollator()

dataloader = DataLoader(processed_datasets, batch_size=batch_size, collate_fn=collator)

# Generation loop
results = []
with torch.no_grad():
    for step, data in enumerate(dataloader):
        responses = llm.generate(data)
        for attack_name, task_name, target, response, message, position in zip(...):
            results.append({"attack_name": ..., "task_name": ..., "response": ..., ...})
        if log_steps and step % log_steps == 0:
            write_jsonl(output_path, results)

write_jsonl(output_path, results)

Related Pages

Implementation:Microsoft_BIPIA_Inference_Pipeline

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment