Principle:Bigscience workshop Petals Block Loading And Serving
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, Model_Serving, Infrastructure |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
The process of loading transformer block weights from HuggingFace, applying quantization, wrapping them in execution backends, and serving them to clients via P2P RPC handlers with shared KV cache management.
Description
Block Loading and Serving is the core server-side mechanism that makes distributed inference possible. It involves:
Weight loading:
- Downloads model shards from HuggingFace (safetensors or PyTorch format)
- Uses a disk cache with LRU eviction to avoid redundant downloads
- Loads only the weights for the assigned block indices
Block preparation:
- Instantiates the correct transformer block class (Bloom, Llama, Falcon, Mixtral)
- Applies quantization (INT8 or NF4 via bitsandbytes)
- Optionally wraps for tensor parallelism across multiple GPUs
- Wraps in TransformerBackend which manages inference/forward/backward execution
- Optionally pre-loads LoRA adapters
Request handling:
- TransformerConnectionHandler handles P2P RPC endpoints: rpc_inference (streaming), rpc_forward, rpc_backward
- MemoryCache manages shared KV cache memory with async allocation and timeout
- PrioritizedTaskPool schedules concurrent requests from multiple clients
- ModuleAnnouncerThread periodically declares ONLINE state in DHT
Usage
This principle is applied automatically within ModuleContainer.create() during server startup. Server operators control it via CLI flags for quantization, tensor parallelism, and adapter pre-loading.
Theoretical Basis
Server-side block execution pipeline:
# Abstract block serving architecture
for block_idx in assigned_blocks:
weights = load_pretrained_block(model_name, block_idx, cache_dir)
block = quantize(weights, quant_type)
backend = TransformerBackend(block, memory_cache, max_batch_size)
handler = TransformerConnectionHandler(backends, memory_cache)
# handler.rpc_inference: streaming KV-cached autoregressive
# handler.rpc_forward: single forward pass (training)
# handler.rpc_backward: gradient computation (training)
runtime = Runtime(handler)
runtime.start() # Process requests from task pools
Memory cache management:
- Cache is a shared tensor pool with async allocation
- Clients request cache slots for their inference sessions
- Allocation uses a priority queue with timeout
- Cache is freed when sessions end or timeout