Implementation:Bigscience workshop Petals ModuleContainer Create
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, Model_Serving, Infrastructure |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
Concrete tool for loading, quantizing, and serving transformer blocks as RPC endpoints, provided by the Petals server module.
Description
ModuleContainer.create() is the factory method that assembles the complete server-side serving infrastructure:
- Load blocks: Calls load_pretrained_block() for each block index
- Convert blocks: Applies quantization (INT8/NF4) and tensor parallelism via convert_block()
- Create backends: Wraps each block in TransformerBackend with task pools
- Setup memory cache: Creates MemoryCache with configured byte budget for KV caches
- Create handler: Instantiates TransformerConnectionHandler with all backends
- Start announcer: Launches ModuleAnnouncerThread to declare blocks as ONLINE in DHT
- Launch runtime: Returns a ModuleContainer thread that runs the hivemind Runtime
The method returns a running ModuleContainer thread that serves requests until shutdown.
Usage
Called automatically by Server.run() after block selection. Not typically called directly by users.
Code Reference
Source Location
- Repository: petals
- File: src/petals/server/server.py (L435-555, ModuleContainer.create)
- File: src/petals/server/from_pretrained.py (L35-75, load_pretrained_block)
- File: src/petals/server/handler.py (L55-93, TransformerConnectionHandler.__init__)
- File: src/petals/server/memory_cache.py (L26-42, MemoryCache.__init__)
Signature
class ModuleContainer(threading.Thread):
@classmethod
def create(
cls,
*,
dht: DHT,
dht_prefix: str,
converted_model_name_or_path: str,
block_config: PretrainedConfig,
attn_cache_bytes: int,
server_info: ServerInfo,
model_info: ModelInfo,
block_indices: List[int],
min_batch_size: int,
max_batch_size: int,
max_chunk_size_bytes: int,
max_alloc_timeout: float,
torch_dtype: torch.dtype,
cache_dir: str,
max_disk_space: int,
device: Union[str, torch.device],
compression: CompressionType,
update_period: float,
expiration: Optional[float],
revision: Optional[str],
token: Optional[Union[str, bool]],
quant_type: QuantType,
tensor_parallel_devices: Sequence[torch.device],
should_validate_reachability: bool,
**kwargs,
) -> "ModuleContainer":
"""
Factory method that loads blocks, creates backends, and starts serving.
Args:
block_indices: List of block indices to load and serve
attn_cache_bytes: Total KV cache memory budget
max_alloc_timeout: Timeout for cache allocation (600s default)
quant_type: Quantization type (NF4, INT8, NONE)
tensor_parallel_devices: Devices for TP sharding
Returns:
Running ModuleContainer thread
"""
Import
from petals.server.server import ModuleContainer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| block_indices | List[int] | Yes | Block indices to load and serve |
| converted_model_name_or_path | str | Yes | HuggingFace model name for weight download |
| attn_cache_bytes | int | Yes | Total KV cache memory budget in bytes |
| quant_type | QuantType | Yes | Quantization format (NF4, INT8, NONE) |
| torch_dtype | torch.dtype | Yes | Weight data type |
| device | Union[str, torch.device] | Yes | GPU device for block execution |
Outputs
| Name | Type | Description |
|---|---|---|
| container | ModuleContainer | Running thread serving blocks via RPC endpoints |
| container.module_backends | Dict[str, TransformerBackend] | Loaded and quantized block backends |
| DHT state | dict | Blocks announced as ONLINE in hivemind DHT |
Usage Examples
Understanding the Internal Flow
# This is called internally by Server.run()
# The flow is:
# 1. Load pretrained blocks
for block_idx in block_indices:
block = load_pretrained_block(
model_name, block_idx,
config=block_config,
torch_dtype=torch_dtype,
cache_dir=cache_dir,
)
block = convert_block(block, block_idx, block_config, quant_type, ...)
backend = TransformerBackend(block, ...)
backends[f"{dht_prefix}.{block_idx}"] = backend
# 2. Create memory cache
memory_cache = MemoryCache(max_size_bytes=attn_cache_bytes, max_alloc_timeout=max_alloc_timeout)
# 3. Create connection handler
handler = TransformerConnectionHandler(dht, backends, inference_max_length=..., ...)
# 4. Start serving
container = ModuleContainer(dht, backends, handler=handler, ...)
container.start()
Related Pages
Implements Principle
Requires Environment
Uses Heuristic
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment