Implementation:Bigscience workshop Petals ModuleContainer Create

Knowledge Sources	Petals
Domains	Distributed_Computing, Model_Serving, Infrastructure
Last Updated	2026-02-09 14:00 GMT

Overview

Concrete tool for loading, quantizing, and serving transformer blocks as RPC endpoints, provided by the Petals server module.

Description

ModuleContainer.create() is the factory method that assembles the complete server-side serving infrastructure:

Load blocks: Calls load_pretrained_block() for each block index
Convert blocks: Applies quantization (INT8/NF4) and tensor parallelism via convert_block()
Create backends: Wraps each block in TransformerBackend with task pools
Setup memory cache: Creates MemoryCache with configured byte budget for KV caches
Create handler: Instantiates TransformerConnectionHandler with all backends
Start announcer: Launches ModuleAnnouncerThread to declare blocks as ONLINE in DHT
Launch runtime: Returns a ModuleContainer thread that runs the hivemind Runtime

The method returns a running ModuleContainer thread that serves requests until shutdown.

Usage

Called automatically by Server.run() after block selection. Not typically called directly by users.

Code Reference

Source Location

Repository: petals
File: src/petals/server/server.py (L435-555, ModuleContainer.create)
File: src/petals/server/from_pretrained.py (L35-75, load_pretrained_block)
File: src/petals/server/handler.py (L55-93, TransformerConnectionHandler.__init__)
File: src/petals/server/memory_cache.py (L26-42, MemoryCache.__init__)

Signature

class ModuleContainer(threading.Thread):
    @classmethod
    def create(
        cls,
        *,
        dht: DHT,
        dht_prefix: str,
        converted_model_name_or_path: str,
        block_config: PretrainedConfig,
        attn_cache_bytes: int,
        server_info: ServerInfo,
        model_info: ModelInfo,
        block_indices: List[int],
        min_batch_size: int,
        max_batch_size: int,
        max_chunk_size_bytes: int,
        max_alloc_timeout: float,
        torch_dtype: torch.dtype,
        cache_dir: str,
        max_disk_space: int,
        device: Union[str, torch.device],
        compression: CompressionType,
        update_period: float,
        expiration: Optional[float],
        revision: Optional[str],
        token: Optional[Union[str, bool]],
        quant_type: QuantType,
        tensor_parallel_devices: Sequence[torch.device],
        should_validate_reachability: bool,
        **kwargs,
    ) -> "ModuleContainer":
        """
        Factory method that loads blocks, creates backends, and starts serving.

        Args:
            block_indices: List of block indices to load and serve
            attn_cache_bytes: Total KV cache memory budget
            max_alloc_timeout: Timeout for cache allocation (600s default)
            quant_type: Quantization type (NF4, INT8, NONE)
            tensor_parallel_devices: Devices for TP sharding
        Returns:
            Running ModuleContainer thread
        """

Import

from petals.server.server import ModuleContainer

I/O Contract

Inputs

Name	Type	Required	Description
block_indices	List[int]	Yes	Block indices to load and serve
converted_model_name_or_path	str	Yes	HuggingFace model name for weight download
attn_cache_bytes	int	Yes	Total KV cache memory budget in bytes
quant_type	QuantType	Yes	Quantization format (NF4, INT8, NONE)
torch_dtype	torch.dtype	Yes	Weight data type
device	Union[str, torch.device]	Yes	GPU device for block execution

Outputs

Name	Type	Description
container	ModuleContainer	Running thread serving blocks via RPC endpoints
container.module_backends	Dict[str, TransformerBackend]	Loaded and quantized block backends
DHT state	dict	Blocks announced as ONLINE in hivemind DHT

Usage Examples

Understanding the Internal Flow

# This is called internally by Server.run()
# The flow is:

# 1. Load pretrained blocks
for block_idx in block_indices:
    block = load_pretrained_block(
        model_name, block_idx,
        config=block_config,
        torch_dtype=torch_dtype,
        cache_dir=cache_dir,
    )
    block = convert_block(block, block_idx, block_config, quant_type, ...)
    backend = TransformerBackend(block, ...)
    backends[f"{dht_prefix}.{block_idx}"] = backend

# 2. Create memory cache
memory_cache = MemoryCache(max_size_bytes=attn_cache_bytes, max_alloc_timeout=max_alloc_timeout)

# 3. Create connection handler
handler = TransformerConnectionHandler(dht, backends, inference_max_length=..., ...)

# 4. Start serving
container = ModuleContainer(dht, backends, handler=handler, ...)
container.start()

Related Pages

Implements Principle

Principle:Bigscience_workshop_Petals_Block_Loading_And_Serving

Requires Environment

Environment:Bigscience_workshop_Petals_CUDA_Server

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment