Principle:Bigscience workshop Petals Block Loading And Serving

Knowledge Sources	Petals Petals: Collaborative Inference and Fine-tuning of Large Models
Domains	Distributed_Computing, Model_Serving, Infrastructure
Last Updated	2026-02-09 14:00 GMT

Overview

The process of loading transformer block weights from HuggingFace, applying quantization, wrapping them in execution backends, and serving them to clients via P2P RPC handlers with shared KV cache management.

Description

Block Loading and Serving is the core server-side mechanism that makes distributed inference possible. It involves:

Weight loading:

Downloads model shards from HuggingFace (safetensors or PyTorch format)
Uses a disk cache with LRU eviction to avoid redundant downloads
Loads only the weights for the assigned block indices

Block preparation:

Instantiates the correct transformer block class (Bloom, Llama, Falcon, Mixtral)
Applies quantization (INT8 or NF4 via bitsandbytes)
Optionally wraps for tensor parallelism across multiple GPUs
Wraps in TransformerBackend which manages inference/forward/backward execution
Optionally pre-loads LoRA adapters

Request handling:

TransformerConnectionHandler handles P2P RPC endpoints: rpc_inference (streaming), rpc_forward, rpc_backward
MemoryCache manages shared KV cache memory with async allocation and timeout
PrioritizedTaskPool schedules concurrent requests from multiple clients
ModuleAnnouncerThread periodically declares ONLINE state in DHT

Usage

This principle is applied automatically within ModuleContainer.create() during server startup. Server operators control it via CLI flags for quantization, tensor parallelism, and adapter pre-loading.

Theoretical Basis

Server-side block execution pipeline:

# Abstract block serving architecture
for block_idx in assigned_blocks:
    weights = load_pretrained_block(model_name, block_idx, cache_dir)
    block = quantize(weights, quant_type)
    backend = TransformerBackend(block, memory_cache, max_batch_size)

handler = TransformerConnectionHandler(backends, memory_cache)
# handler.rpc_inference: streaming KV-cached autoregressive
# handler.rpc_forward: single forward pass (training)
# handler.rpc_backward: gradient computation (training)

runtime = Runtime(handler)
runtime.start()  # Process requests from task pools

Memory cache management:

Cache is a shared tensor pool with async allocation
Clients request cache slots for their inference sessions
Allocation uses a priority queue with timeout
Cache is freed when sessions end or timeout

Related Pages

Implemented By

Implementation:Bigscience_workshop_Petals_ModuleContainer_Create

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment