Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Bigscience workshop Petals ModuleContainer Create

From Leeroopedia


Knowledge Sources
Domains Distributed_Computing, Model_Serving, Infrastructure
Last Updated 2026-02-09 14:00 GMT

Overview

Concrete tool for loading, quantizing, and serving transformer blocks as RPC endpoints, provided by the Petals server module.

Description

ModuleContainer.create() is the factory method that assembles the complete server-side serving infrastructure:

  1. Load blocks: Calls load_pretrained_block() for each block index
  2. Convert blocks: Applies quantization (INT8/NF4) and tensor parallelism via convert_block()
  3. Create backends: Wraps each block in TransformerBackend with task pools
  4. Setup memory cache: Creates MemoryCache with configured byte budget for KV caches
  5. Create handler: Instantiates TransformerConnectionHandler with all backends
  6. Start announcer: Launches ModuleAnnouncerThread to declare blocks as ONLINE in DHT
  7. Launch runtime: Returns a ModuleContainer thread that runs the hivemind Runtime

The method returns a running ModuleContainer thread that serves requests until shutdown.

Usage

Called automatically by Server.run() after block selection. Not typically called directly by users.

Code Reference

Source Location

  • Repository: petals
  • File: src/petals/server/server.py (L435-555, ModuleContainer.create)
  • File: src/petals/server/from_pretrained.py (L35-75, load_pretrained_block)
  • File: src/petals/server/handler.py (L55-93, TransformerConnectionHandler.__init__)
  • File: src/petals/server/memory_cache.py (L26-42, MemoryCache.__init__)

Signature

class ModuleContainer(threading.Thread):
    @classmethod
    def create(
        cls,
        *,
        dht: DHT,
        dht_prefix: str,
        converted_model_name_or_path: str,
        block_config: PretrainedConfig,
        attn_cache_bytes: int,
        server_info: ServerInfo,
        model_info: ModelInfo,
        block_indices: List[int],
        min_batch_size: int,
        max_batch_size: int,
        max_chunk_size_bytes: int,
        max_alloc_timeout: float,
        torch_dtype: torch.dtype,
        cache_dir: str,
        max_disk_space: int,
        device: Union[str, torch.device],
        compression: CompressionType,
        update_period: float,
        expiration: Optional[float],
        revision: Optional[str],
        token: Optional[Union[str, bool]],
        quant_type: QuantType,
        tensor_parallel_devices: Sequence[torch.device],
        should_validate_reachability: bool,
        **kwargs,
    ) -> "ModuleContainer":
        """
        Factory method that loads blocks, creates backends, and starts serving.

        Args:
            block_indices: List of block indices to load and serve
            attn_cache_bytes: Total KV cache memory budget
            max_alloc_timeout: Timeout for cache allocation (600s default)
            quant_type: Quantization type (NF4, INT8, NONE)
            tensor_parallel_devices: Devices for TP sharding
        Returns:
            Running ModuleContainer thread
        """

Import

from petals.server.server import ModuleContainer

I/O Contract

Inputs

Name Type Required Description
block_indices List[int] Yes Block indices to load and serve
converted_model_name_or_path str Yes HuggingFace model name for weight download
attn_cache_bytes int Yes Total KV cache memory budget in bytes
quant_type QuantType Yes Quantization format (NF4, INT8, NONE)
torch_dtype torch.dtype Yes Weight data type
device Union[str, torch.device] Yes GPU device for block execution

Outputs

Name Type Description
container ModuleContainer Running thread serving blocks via RPC endpoints
container.module_backends Dict[str, TransformerBackend] Loaded and quantized block backends
DHT state dict Blocks announced as ONLINE in hivemind DHT

Usage Examples

Understanding the Internal Flow

# This is called internally by Server.run()
# The flow is:

# 1. Load pretrained blocks
for block_idx in block_indices:
    block = load_pretrained_block(
        model_name, block_idx,
        config=block_config,
        torch_dtype=torch_dtype,
        cache_dir=cache_dir,
    )
    block = convert_block(block, block_idx, block_config, quant_type, ...)
    backend = TransformerBackend(block, ...)
    backends[f"{dht_prefix}.{block_idx}"] = backend

# 2. Create memory cache
memory_cache = MemoryCache(max_size_bytes=attn_cache_bytes, max_alloc_timeout=max_alloc_timeout)

# 3. Create connection handler
handler = TransformerConnectionHandler(dht, backends, inference_max_length=..., ...)

# 4. Start serving
container = ModuleContainer(dht, backends, handler=handler, ...)
container.start()

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment