Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Turboderp org Exllamav2 ThreadPool

From Leeroopedia
Knowledge Sources
Domains Concurrency, C_Extension
Last Updated 2026-02-15 00:00 GMT

Overview

Header-only C++ library providing a ThreadPool class for asynchronous task execution and a Barrier class for thread synchronization, used throughout ExLlamaV2's C++ extension layer.

Description

threadpool.h defines two concurrency primitives:

ThreadPool implements a classic thread pool pattern with:

  • A configurable number of worker threads created at construction time via ThreadPool(size_t threads).
  • A task queue protected by a std::mutex and signaled via a std::condition_variable.
  • A templated enqueue() method that accepts any callable with arguments and returns a std::future for the result, allowing callers to submit work and retrieve results asynchronously.
  • Worker threads that loop indefinitely, waiting on the condition variable for new tasks. They exit cleanly when stop is set to true and the queue is drained.
  • The destructor sets the stop flag, notifies all workers, and joins all threads to ensure clean shutdown.

Barrier implements a reusable synchronization barrier with:

  • arrive_and_wait() -- Each thread increments a counter; when the counter reaches num_threads, the generation is advanced and all waiting threads are released via cv.notify_all(). Threads that arrive early wait on a condition variable gated by the generation counter, preventing spurious wakeups.
  • reset(int new_num_threads) -- Dynamically changes the thread count, resets the counter, advances the generation to unblock any currently waiting threads, and notifies all.

Usage

The ThreadPool is used by ExtTPContext (tensor parallelism context) to dispatch parallel operations across multiple GPU devices. The Barrier is used for cross-device synchronization points during tensor-parallel inference, ensuring all devices have completed a phase before proceeding.

Code Reference

Source Location

Signature

class ThreadPool
{
public:
    ThreadPool(size_t threads);
    ~ThreadPool();

    template<class F, class... Args>
    auto enqueue(F&& f, Args&&... args)
        -> std::future<typename std::result_of<F(Args...)>::type>;
};

class Barrier
{
public:
    Barrier(int num_threads);
    void arrive_and_wait();
    void reset(int new_num_threads);
};

Import

#include "threadpool.h"

I/O Contract

Class Method Input Output Description
ThreadPool constructor size_t threads ThreadPool instance Creates pool with specified number of worker threads
ThreadPool enqueue(f, args...) Callable + arguments std::future<return_type> Submits task, returns future for asynchronous result retrieval
ThreadPool destructor -- -- Sets stop flag, notifies all workers, joins all threads
Barrier constructor int num_threads Barrier instance Creates barrier for the specified number of participating threads
Barrier arrive_and_wait() -- -- Blocks until all threads have arrived; uses generation counter to prevent spurious wakeups
Barrier reset(new_num_threads) int new_num_threads -- Resets barrier for a new thread count, unblocks any waiting threads

Usage Examples

#include "threadpool.h"

// Create a pool with 4 worker threads
ThreadPool pool(4);

// Submit tasks and collect futures
std::vector<std::future<int>> results;
for (int i = 0; i < 8; i++) {
    results.push_back(pool.enqueue([i] {
        // perform work on device i % 4
        return i * i;
    }));
}

// Collect results
for (auto& f : results) {
    int result = f.get();
}

// Barrier usage for synchronizing 4 threads
Barrier barrier(4);
// Each thread calls:
barrier.arrive_and_wait();  // blocks until all 4 arrive

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment