Heuristic:Google deepmind Mujoco Thread Pool Configuration
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Physics_Simulation |
| Last Updated | 2026-02-15 05:00 GMT |
Overview
Use engine-internal thread pools (`npoolthread > 1`) only for complex models with many contacts/constraints; single-threaded execution is faster for simple models due to overhead.
Description
MuJoCo supports two levels of threading: external rollout parallelism (multiple `mjData` instances on separate threads) and engine-internal thread pools (via `mju_threadPoolCreate`). The thread pool enables parallel constraint solving within a single simulation step. However, the overhead of thread synchronization means that for simple models with few contacts, single-threaded execution is faster. The `testspeed` benchmark tool exposes both threading modes through its `nthread` and `npoolthread` parameters.
Usage
Use this heuristic when tuning simulation throughput for the native C API, deciding between rollout parallelism vs step parallelism, or configuring benchmarks with testspeed.
The Insight (Rule of Thumb)
- Action: For most robotics models (< 100 contacts), keep `npoolthread=0` (disabled).
- Value: Enable thread pool (`npoolthread > 1`) only when `d->nefc` (constraint count) is consistently high (hundreds+).
- Trade-off: Thread pool adds synchronization overhead per step. For simple models, this overhead exceeds the parallelism benefit.
- External threading: Use `nthread > 1` for running independent rollouts in parallel (no shared state, no overhead).
- Maximum: Thread count is clamped to 512 (`maxthread`) in the reference implementation.
Reasoning
The island-based solver in MuJoCo can partition constraints into independent groups (islands) and solve them in parallel. This is beneficial when:
- The model has multiple disjoint contact groups (e.g., many separate objects).
- The constraint count per step is high enough that parallel solving saves more time than synchronization costs.
- The number of solver iterations is > 1, amplifying the benefit.
For typical articulated robots (humanoid, arm), the constraint graph is usually a single connected component, making island parallelism less effective.
External rollout parallelism (separate `mjData` per thread) has no overhead because there is no shared state, making it always beneficial when you need multiple trajectories.
Code Evidence
Thread pool creation from `sample/testspeed.cc:219-223`:
// make and bind threadpool
if (npoolthread > 1) {
mjThreadPool* threadpool = mju_threadPoolCreate(npoolthread);
mju_bindThreadPool(d[id], threadpool);
}
Thread count clamping from `sample/testspeed.cc:29,187-188`:
const int maxthread = 512;
nthread = mjMAX(1, mjMIN(maxthread, nthread));
npoolthread = mjMAX(1, mjMIN(maxthread, npoolthread));
Island-based iteration averaging from `sample/testspeed.cc:123-132`:
int nisland = mjMAX(1, mjMIN(d[id]->nisland, mjNISLAND));
if (nisland == 1 || nisland == 0) {
iterations[id] += d[id]->solver_niter[0];
} else {
mjtNum niter = 0;
for (int j=0; j < nisland; j++) {
niter += d[id]->solver_niter[j];
}
iterations[id] += niter / nisland;
}
Control noise generation for benchmarking from `sample/testspeed.cc:64-69`:
// convert rate and scale to discrete time (Ornstein-Uhlenbeck)
mjtNum rate = mju_exp(-m->opt.timestep / ctrl_noise_rate);
mjtNum scale = ctrl_noise_std * mju_sqrt(1-rate*rate);