Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Turboderp org Exllamav2 ExLlamaV2WebSocketServer

From Leeroopedia
Knowledge Sources
Domains Server, WebSocket, Inference
Last Updated 2026-02-15 00:00 GMT

Overview

ExLlamaV2WebSocketServer is a WebSocket server that accepts JSON requests over persistent connections and dispatches them to action handlers for streaming text generation.

Description

This class sets up a WebSocket server using the websockets library. It binds to a specified IP address and port, then listens for incoming JSON messages. Each received message is parsed and dispatched asynchronously via websocket_actions.dispatch() to the appropriate handler (echo, estimate_token, lefttrim_token, infer, stop).

The server maintains several key resources:

  • model - The loaded ExLlamaV2 model instance
  • tokenizer - The ExLlamaV2Tokenizer for encoding/decoding text
  • cache - The ExLlamaV2Cache for KV cache management
  • generator - An ExLlamaV2StreamingGenerator for incremental text generation
  • model_lock - An asyncio.Lock ensuring only one inference request uses the model at a time
  • stop_signal - A threading.Event for interrupting active generation
  • active_requests - A list tracking in-flight asyncio tasks, pruned on each new message

The main() coroutine handles each WebSocket connection, creating a new asyncio task for each incoming request. This allows multiple requests to be queued while the model lock serializes actual inference.

Usage

Use ExLlamaV2WebSocketServer to expose an ExLlamaV2 model over WebSocket for real-time streaming inference. It is suitable for chat applications, interactive demos, or any client that benefits from low-latency bidirectional communication. Call serve() to start the event loop and begin accepting connections.

Code Reference

Source Location

Signature

class ExLlamaV2WebSocketServer:

    ip: str
    port: int
    model: ExLlamaV2
    tokenizer: ExLlamaV2Tokenizer
    cache: ExLlamaV2Cache
    generator: ExLlamaV2StreamingGenerator
    stop_signal: threading.Event
    model_lock: asyncio.Lock
    active_requests: list

    def __init__(
        self,
        ip: str,
        port: int,
        model: ExLlamaV2,
        tokenizer: ExLlamaV2Tokenizer,
        cache: ExLlamaV2Cache,
    ): ...

    def serve(self): ...

    async def main(self, websocket, path): ...

Import

from exllamav2.server.websocket import ExLlamaV2WebSocketServer

I/O Contract

__init__()

Parameter Type Description
ip str IP address to bind the WebSocket server to (e.g., "0.0.0.0")
port int Port number to listen on (e.g., 7862)
model ExLlamaV2 Loaded ExLlamaV2 model instance
tokenizer ExLlamaV2Tokenizer Tokenizer for the model
cache ExLlamaV2Cache KV cache for inference

main()

Parameter Type Description
websocket WebSocketServerProtocol The WebSocket connection object
path str The URL path of the connection (unused)

Message Format (JSON Input)

Field Type Description
action str Action to dispatch: "echo", "estimate_token", "lefttrim_token", "infer", "stop"
request_id str (Optional) Request identifier echoed in response
response_id str (Optional) Response identifier echoed in response
... various Additional fields depending on the action (see Turboderp_org_Exllamav2_WebSocket_Actions)

Usage Examples

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.server.websocket import ExLlamaV2WebSocketServer

# Load model, tokenizer, and cache
config = ExLlamaV2Config(model_dir)
model = ExLlamaV2(config)
model.load()
tokenizer = ExLlamaV2Tokenizer(config)
cache = ExLlamaV2Cache(model)

# Create and start the WebSocket server
server = ExLlamaV2WebSocketServer(
    ip="0.0.0.0",
    port=7862,
    model=model,
    tokenizer=tokenizer,
    cache=cache,
)
server.serve()  # Blocks, runs asyncio event loop forever

# Client-side (JavaScript example):
# const ws = new WebSocket("ws://localhost:7862");
# ws.send(JSON.stringify({
#     action: "infer",
#     text: "Hello, world!",
#     max_new_tokens: 100,
#     stream: true,
#     temperature: 0.7,
# }));

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment