Implementation:Turboderp org Exllamav2 ExLlamaV2WebSocketServer
| Knowledge Sources | |
|---|---|
| Domains | Server, WebSocket, Inference |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
ExLlamaV2WebSocketServer is a WebSocket server that accepts JSON requests over persistent connections and dispatches them to action handlers for streaming text generation.
Description
This class sets up a WebSocket server using the websockets library. It binds to a specified IP address and port, then listens for incoming JSON messages. Each received message is parsed and dispatched asynchronously via websocket_actions.dispatch() to the appropriate handler (echo, estimate_token, lefttrim_token, infer, stop).
The server maintains several key resources:
- model - The loaded ExLlamaV2 model instance
- tokenizer - The ExLlamaV2Tokenizer for encoding/decoding text
- cache - The ExLlamaV2Cache for KV cache management
- generator - An ExLlamaV2StreamingGenerator for incremental text generation
- model_lock - An asyncio.Lock ensuring only one inference request uses the model at a time
- stop_signal - A threading.Event for interrupting active generation
- active_requests - A list tracking in-flight asyncio tasks, pruned on each new message
The main() coroutine handles each WebSocket connection, creating a new asyncio task for each incoming request. This allows multiple requests to be queued while the model lock serializes actual inference.
Usage
Use ExLlamaV2WebSocketServer to expose an ExLlamaV2 model over WebSocket for real-time streaming inference. It is suitable for chat applications, interactive demos, or any client that benefits from low-latency bidirectional communication. Call serve() to start the event loop and begin accepting connections.
Code Reference
Source Location
- Repository: Turboderp_org_Exllamav2
- File: exllamav2/server/websocket.py
- Lines: 1-65
Signature
class ExLlamaV2WebSocketServer:
ip: str
port: int
model: ExLlamaV2
tokenizer: ExLlamaV2Tokenizer
cache: ExLlamaV2Cache
generator: ExLlamaV2StreamingGenerator
stop_signal: threading.Event
model_lock: asyncio.Lock
active_requests: list
def __init__(
self,
ip: str,
port: int,
model: ExLlamaV2,
tokenizer: ExLlamaV2Tokenizer,
cache: ExLlamaV2Cache,
): ...
def serve(self): ...
async def main(self, websocket, path): ...
Import
from exllamav2.server.websocket import ExLlamaV2WebSocketServer
I/O Contract
__init__()
| Parameter | Type | Description |
|---|---|---|
| ip | str |
IP address to bind the WebSocket server to (e.g., "0.0.0.0") |
| port | int |
Port number to listen on (e.g., 7862) |
| model | ExLlamaV2 |
Loaded ExLlamaV2 model instance |
| tokenizer | ExLlamaV2Tokenizer |
Tokenizer for the model |
| cache | ExLlamaV2Cache |
KV cache for inference |
main()
| Parameter | Type | Description |
|---|---|---|
| websocket | WebSocketServerProtocol |
The WebSocket connection object |
| path | str |
The URL path of the connection (unused) |
Message Format (JSON Input)
| Field | Type | Description |
|---|---|---|
| action | str |
Action to dispatch: "echo", "estimate_token", "lefttrim_token", "infer", "stop" |
| request_id | str |
(Optional) Request identifier echoed in response |
| response_id | str |
(Optional) Response identifier echoed in response |
| ... | various |
Additional fields depending on the action (see Turboderp_org_Exllamav2_WebSocket_Actions) |
Usage Examples
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.server.websocket import ExLlamaV2WebSocketServer
# Load model, tokenizer, and cache
config = ExLlamaV2Config(model_dir)
model = ExLlamaV2(config)
model.load()
tokenizer = ExLlamaV2Tokenizer(config)
cache = ExLlamaV2Cache(model)
# Create and start the WebSocket server
server = ExLlamaV2WebSocketServer(
ip="0.0.0.0",
port=7862,
model=model,
tokenizer=tokenizer,
cache=cache,
)
server.serve() # Blocks, runs asyncio event loop forever
# Client-side (JavaScript example):
# const ws = new WebSocket("ws://localhost:7862");
# ws.send(JSON.stringify({
# action: "infer",
# text: "Hello, world!",
# max_new_tokens: 100,
# stream: true,
# temperature: 0.7,
# }));
Related Pages
- Turboderp_org_Exllamav2_WebSocket_Actions - Action handler functions dispatched by this server
- Turboderp_org_Exllamav2_ExLlamaV2DynamicGeneratorAsync - Alternative async generation approach for more advanced use cases