Principle:Bigscience workshop Petals Interactive Generation
| Knowledge Sources | |
|---|---|
| Domains | NLP, Dialogue, Text_Generation |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
A multi-turn conversational generation pattern where an inference session persists across dialogue turns, maintaining KV cache state for efficient context-aware response generation.
Description
Interactive Generation extends standard autoregressive generation for multi-turn dialogue by maintaining a persistent InferenceSession across conversation turns. This is critical for chatbot applications where:
- Context accumulates: Each turn adds to the conversation history
- KV cache persists: The session's KV cache on remote servers stores attention states from all previous turns, avoiding re-computation
- Efficient generation: Only new input tokens need to be processed through the transformer blocks
The pattern differs from single-shot generation in that:
- The session is opened once and reused across multiple generate() calls
- The session's position property tracks how many tokens have been processed
- Prompt tuning embeddings (if trained) are included in the session context
Usage
Use this principle when building interactive chatbots or multi-turn dialogue systems with distributed Petals models. The session should be opened with sufficient max_length to accommodate the entire conversation (all turns combined).
Theoretical Basis
Multi-turn session protocol:
# Abstract interactive generation
with inference_session(max_length=2048) as session:
# Turn 1
input1 = tokenize("User: Hello!\nAssistant:")
response1 = generate(input1, session) # KV cache stores turn 1
# Turn 2 (only new tokens processed, turn 1 is cached)
input2 = tokenize(response1 + "\nUser: Tell me more.\nAssistant:")
response2 = generate(input2, session) # KV cache now has turns 1+2
# Session position advances with each turn
# No re-computation of previous turns needed
KV cache efficiency: For a conversation of T turns with average length L tokens each, re-computation without sessions costs O(T*L*T*L) attention operations. With sessions, it costs O(T*L^2) — a factor of T improvement.