GPU-accelerated simulation at scale

Training robust robot policies requires massive amounts of simulation data. A single manipulation task might need billions of environment steps before the policy converges. Running these simulations sequentially would take years.

At Rebel Labs, we've built infrastructure that runs thousands of physics simulations in parallel on GPUs. This post explains how we achieved this scale and the engineering challenges we solved along the way.

The challenge of parallel simulation

Traditional physics simulators like MuJoCo were designed for single-threaded execution. While they're fast for individual simulations, they don't scale well across multiple cores or GPUs.

The naive approach—running many simulator instances in separate processes—introduces significant overhead from inter-process communication and memory duplication. For robotics workloads where we need tight integration between simulation and learning, this overhead becomes a bottleneck.

Our approach

We built a custom simulation backend that runs entirely on the GPU. The key insight is that physics simulation is inherently parallel: computing contact forces, integrating dynamics, and rendering observations can all be vectorized across thousands of environments.

Our system achieves:

10,000+ environments per GPU — Each A100 can simulate over 10,000 robot environments simultaneously.
Sub-millisecond step times — A single simulation step across all environments takes less than 1ms.
Zero-copy data transfer — Observations flow directly to the policy network without leaving GPU memory.
Automatic batching — The system handles all the complexity of batched simulation behind a simple API.

Domain randomization at scale

Parallel simulation also enables aggressive domain randomization. We can vary physics parameters, lighting conditions, and sensor noise across all environments simultaneously. This produces policies that transfer better to real hardware.

In our experiments, policies trained with GPU-accelerated domain randomization show 3x better sim-to-real transfer compared to policies trained with fixed simulation parameters.

What this means for you

With RebelAI, you get access to this infrastructure out of the box. Define your robot and task, and the platform automatically parallelizes your simulations across available GPUs. No infrastructure setup required.

We're continuing to push the boundaries of simulation scale. Stay tuned for more updates on our research.