Heuristic:Google deepmind Mujoco Thread Pool Configuration

Knowledge Sources	MuJoCo testspeed.cc
Domains	Optimization, Physics_Simulation
Last Updated	2026-02-15 05:00 GMT

Overview

Use engine-internal thread pools (`npoolthread > 1`) only for complex models with many contacts/constraints; single-threaded execution is faster for simple models due to overhead.

Description

MuJoCo supports two levels of threading: external rollout parallelism (multiple `mjData` instances on separate threads) and engine-internal thread pools (via `mju_threadPoolCreate`). The thread pool enables parallel constraint solving within a single simulation step. However, the overhead of thread synchronization means that for simple models with few contacts, single-threaded execution is faster. The `testspeed` benchmark tool exposes both threading modes through its `nthread` and `npoolthread` parameters.

Usage

Use this heuristic when tuning simulation throughput for the native C API, deciding between rollout parallelism vs step parallelism, or configuring benchmarks with testspeed.

The Insight (Rule of Thumb)

Action: For most robotics models (< 100 contacts), keep `npoolthread=0` (disabled).
Value: Enable thread pool (`npoolthread > 1`) only when `d->nefc` (constraint count) is consistently high (hundreds+).
Trade-off: Thread pool adds synchronization overhead per step. For simple models, this overhead exceeds the parallelism benefit.
External threading: Use `nthread > 1` for running independent rollouts in parallel (no shared state, no overhead).
Maximum: Thread count is clamped to 512 (`maxthread`) in the reference implementation.

Reasoning

The island-based solver in MuJoCo can partition constraints into independent groups (islands) and solve them in parallel. This is beneficial when:

The model has multiple disjoint contact groups (e.g., many separate objects).
The constraint count per step is high enough that parallel solving saves more time than synchronization costs.
The number of solver iterations is > 1, amplifying the benefit.

For typical articulated robots (humanoid, arm), the constraint graph is usually a single connected component, making island parallelism less effective.

External rollout parallelism (separate `mjData` per thread) has no overhead because there is no shared state, making it always beneficial when you need multiple trajectories.

Code Evidence

Thread pool creation from `sample/testspeed.cc:219-223`:

// make and bind threadpool
if (npoolthread > 1) {
    mjThreadPool* threadpool = mju_threadPoolCreate(npoolthread);
    mju_bindThreadPool(d[id], threadpool);
}

Thread count clamping from `sample/testspeed.cc:29,187-188`:

const int maxthread = 512;

nthread = mjMAX(1, mjMIN(maxthread, nthread));
npoolthread = mjMAX(1, mjMIN(maxthread, npoolthread));

Island-based iteration averaging from `sample/testspeed.cc:123-132`:

int nisland = mjMAX(1, mjMIN(d[id]->nisland, mjNISLAND));
if (nisland == 1 || nisland == 0) {
    iterations[id] += d[id]->solver_niter[0];
} else {
    mjtNum niter = 0;
    for (int j=0; j < nisland; j++) {
        niter += d[id]->solver_niter[j];
    }
    iterations[id] += niter / nisland;
}

Control noise generation for benchmarking from `sample/testspeed.cc:64-69`:

// convert rate and scale to discrete time (Ornstein-Uhlenbeck)
mjtNum rate = mju_exp(-m->opt.timestep / ctrl_noise_rate);
mjtNum scale = ctrl_noise_std * mju_sqrt(1-rate*rate);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment