Implementation:InternLM Lmdeploy Gateway

Knowledge Sources	InternLM_Lmdeploy
Domains	Inference Engine, Request Routing
Last Updated	2026-02-07 15:00 GMT

Overview

Implements the request routing gateway that distributes incoming inference requests across multiple engine queues, manages session-to-queue bindings for stateful inference, and processes asynchronous signals.

Description

The Gateway is the central request dispatch layer in TurboMind. It manages a pool of RequestQueue instances and routes requests to appropriate queues based on session state.

SequenceBinding (helper class): A thread-safe map that tracks which session IDs are bound to which queue ranks. It supports find(), bind(), and unbind() operations, all protected by a mutex. This enables stateful (multi-turn) inference by ensuring subsequent requests for the same session go to the same engine queue.

Gateway class: Constructed with a queue pool size and a context factory. It creates one RequestQueue per size slot and spawns a dedicated signal thread for asynchronous notifications.

Key operations:

push(): Routes a request to the correct queue. For new sessions (start_flag set), it uses round-robin assignment via an atomic counter. For existing sessions, it looks up the binding. If no queue is found, it notifies an error state.
pop(): Retrieves pending inference and kill requests from a specific queue. Supports data-parallel coordination via AllReduce across ranks, only proceeding when enough ranks are ready (controlled by dp_thr_). Assigns monotonically increasing unique IDs, binds new stateful sessions, and unbinds completed kill requests.
cancel(): Atomically marks a request as canceled. If the request is still queued (flag == 0), it sends a cancel notification.
kill(): Sends a kill request to the queue bound to the target session.
notify(): Pushes signal callbacks into a buffer that the signal thread processes asynchronously.

The signal thread runs in a loop, consuming signals from the buffer and executing them within a context created by the factory.

Usage

Created by the TurboMind top-level class and shared with all Engine instances. External callers submit requests via push(), cancel them via cancel(), or terminate sessions via kill(). Engine instances call pop() to retrieve requests for processing.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: src/turbomind/engine/gateway.h
File: src/turbomind/engine/gateway.cc
Lines: gateway.h 1-111, gateway.cc 1-168

Signature

class SequenceBinding {
public:
    int find(uint64_t seq_id);
    void bind(const std::vector<uint64_t>& seq_ids, int rank);
    void unbind(const std::vector<uint64_t>& seq_ids, int rank);
};

class Gateway {
public:
    Gateway(int size, std::function<std::shared_ptr<void>()> ctx_factory);

    void shutdown();

    void push(std::shared_ptr<Request> r);

    void pop(std::vector<std::shared_ptr<Request>>& infer_reqs,
             std::vector<std::shared_ptr<Request>>& kill_reqs,
             unsigned                               max_infer,
             bool                                   blocking,
             bool&                                  abort,
             comm::HostComm&                        dp_group,
             int                                    qid);

    void cancel(std::shared_ptr<Request> r);

    void kill(std::shared_ptr<Request> r);

    void notify(std::vector<Signal> signals, bool pred = true);

    void set_threshold(int value);
};

Import

#include "src/turbomind/engine/gateway.h"

I/O Contract

Inputs

Name	Type	Required	Description
size	int	Yes	Number of queues (engine instances) to create
ctx_factory	std::function<std::shared_ptr<void>()>	Yes	Factory for creating context objects used by the signal thread
r (push/cancel/kill)	std::shared_ptr<Request>	Yes	The request to route, cancel, or kill
max_infer (pop)	unsigned	Yes	Maximum number of inference requests to dequeue at once
blocking (pop)	bool	Yes	Whether pop should block waiting for requests
dp_group (pop)	comm::HostComm&	Yes	Data-parallel communication group for multi-rank coordination
qid (pop)	int	Yes	Queue index to pop from

Outputs

Name	Type	Description
infer_reqs (pop)	std::vector<std::shared_ptr<Request>>&	Inference requests dequeued from the specified queue
kill_reqs (pop)	std::vector<std::shared_ptr<Request>>&	Kill requests dequeued from the specified queue
abort (pop)	bool&	Set to true if the queue is closed

Usage Examples

// Create a gateway with 2 queues
Gateway gateway(2, []() { return std::make_shared<SomeContext>(); });

// Submit a request
auto request = std::make_shared<Request>();
request->session.start_flag = true;
gateway.push(request);

// Pop requests in the engine loop
std::vector<std::shared_ptr<Request>> infer_reqs, kill_reqs;
bool abort = false;
gateway.pop(infer_reqs, kill_reqs, /*max_infer=*/32, /*blocking=*/true, abort, dp_group, /*qid=*/0);

// Cancel a pending request
gateway.cancel(request);

// Shutdown the gateway
gateway.shutdown();

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment