Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:InternLM Lmdeploy Gateway

From Leeroopedia


Knowledge Sources
Domains Inference Engine, Request Routing
Last Updated 2026-02-07 15:00 GMT

Overview

Implements the request routing gateway that distributes incoming inference requests across multiple engine queues, manages session-to-queue bindings for stateful inference, and processes asynchronous signals.

Description

The Gateway is the central request dispatch layer in TurboMind. It manages a pool of RequestQueue instances and routes requests to appropriate queues based on session state.

SequenceBinding (helper class): A thread-safe map that tracks which session IDs are bound to which queue ranks. It supports find(), bind(), and unbind() operations, all protected by a mutex. This enables stateful (multi-turn) inference by ensuring subsequent requests for the same session go to the same engine queue.

Gateway class: Constructed with a queue pool size and a context factory. It creates one RequestQueue per size slot and spawns a dedicated signal thread for asynchronous notifications.

Key operations:

  • push(): Routes a request to the correct queue. For new sessions (start_flag set), it uses round-robin assignment via an atomic counter. For existing sessions, it looks up the binding. If no queue is found, it notifies an error state.
  • pop(): Retrieves pending inference and kill requests from a specific queue. Supports data-parallel coordination via AllReduce across ranks, only proceeding when enough ranks are ready (controlled by dp_thr_). Assigns monotonically increasing unique IDs, binds new stateful sessions, and unbinds completed kill requests.
  • cancel(): Atomically marks a request as canceled. If the request is still queued (flag == 0), it sends a cancel notification.
  • kill(): Sends a kill request to the queue bound to the target session.
  • notify(): Pushes signal callbacks into a buffer that the signal thread processes asynchronously.

The signal thread runs in a loop, consuming signals from the buffer and executing them within a context created by the factory.

Usage

Created by the TurboMind top-level class and shared with all Engine instances. External callers submit requests via push(), cancel them via cancel(), or terminate sessions via kill(). Engine instances call pop() to retrieve requests for processing.

Code Reference

Source Location

Signature

class SequenceBinding {
public:
    int find(uint64_t seq_id);
    void bind(const std::vector<uint64_t>& seq_ids, int rank);
    void unbind(const std::vector<uint64_t>& seq_ids, int rank);
};

class Gateway {
public:
    Gateway(int size, std::function<std::shared_ptr<void>()> ctx_factory);

    void shutdown();

    void push(std::shared_ptr<Request> r);

    void pop(std::vector<std::shared_ptr<Request>>& infer_reqs,
             std::vector<std::shared_ptr<Request>>& kill_reqs,
             unsigned                               max_infer,
             bool                                   blocking,
             bool&                                  abort,
             comm::HostComm&                        dp_group,
             int                                    qid);

    void cancel(std::shared_ptr<Request> r);

    void kill(std::shared_ptr<Request> r);

    void notify(std::vector<Signal> signals, bool pred = true);

    void set_threshold(int value);
};

Import

#include "src/turbomind/engine/gateway.h"

I/O Contract

Inputs

Name Type Required Description
size int Yes Number of queues (engine instances) to create
ctx_factory std::function<std::shared_ptr<void>()> Yes Factory for creating context objects used by the signal thread
r (push/cancel/kill) std::shared_ptr<Request> Yes The request to route, cancel, or kill
max_infer (pop) unsigned Yes Maximum number of inference requests to dequeue at once
blocking (pop) bool Yes Whether pop should block waiting for requests
dp_group (pop) comm::HostComm& Yes Data-parallel communication group for multi-rank coordination
qid (pop) int Yes Queue index to pop from

Outputs

Name Type Description
infer_reqs (pop) std::vector<std::shared_ptr<Request>>& Inference requests dequeued from the specified queue
kill_reqs (pop) std::vector<std::shared_ptr<Request>>& Kill requests dequeued from the specified queue
abort (pop) bool& Set to true if the queue is closed

Usage Examples

// Create a gateway with 2 queues
Gateway gateway(2, []() { return std::make_shared<SomeContext>(); });

// Submit a request
auto request = std::make_shared<Request>();
request->session.start_flag = true;
gateway.push(request);

// Pop requests in the engine loop
std::vector<std::shared_ptr<Request>> infer_reqs, kill_reqs;
bool abort = false;
gateway.pop(infer_reqs, kill_reqs, /*max_infer=*/32, /*blocking=*/true, abort, dp_group, /*qid=*/0);

// Cancel a pending request
gateway.cancel(request);

// Shutdown the gateway
gateway.shutdown();

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment