Principle:Bigscience workshop Petals Chatbot Model Loading
| Knowledge Sources | |
|---|---|
| Domains | NLP, Dialogue, Distributed_Computing |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
Loading a distributed BLOOM causal language model configured for dialogue generation with prompt tuning support, enabling chatbot training and interactive conversation through the Petals network.
Description
Chatbot Model Loading adapts the distributed model loading principle for conversational AI tasks using the BLOOM architecture. The model is loaded with:
- Causal LM head: For next-token prediction during dialogue generation
- Prompt tuning embeddings: For adapting the model to dialogue style via trainable prefix tokens
- RemoteSequential transformer layers: Distributed across volunteer servers
The key distinction from standard distributed model loading is the dual-mode capability:
- Training mode: The model uses _RemoteSequentialAutogradFunction for computing gradients through the distributed blocks, training only prompt embeddings on dialogue data
- Generation mode: The model uses InferenceSession for efficient multi-turn autoregressive generation with KV cache persistence across conversation turns
Usage
Use this principle when building a chatbot or conversational agent using a large BLOOM model distributed across the Petals network. The model supports both training on dialogue datasets (via prompt tuning) and interactive generation with session-based multi-turn conversation.
Theoretical Basis
Causal LM for dialogue:
In dialogue generation, the model is trained on concatenated conversation turns:
where the loss is computed only on assistant response tokens (using a label mask).
Prompt tuning for dialogue style:
# Abstract chatbot training setup
model = load_distributed_bloom(model_name, task="causal_lm")
model.config.tuning_mode = "ptune"
model.config.pre_seq_len = 16
# Training: optimize prompt_embeddings on dialogue data
# Generation: use inference_session for multi-turn conversation