Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SPiDR: A Simple Approach for Zero-Shot Safety in Sim-to-Real Transfer

Authors: Yarden As, Chengrui (Ray) Qu, Benjamin Unger, Dongho Kang, Max van der Hart, Laixi Shi, Stelian Coros, Adam Wierman, Andreas Krause

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments on sim-to-sim benchmarks and two distinct real-world robotic platforms, we demonstrate that SPi DR effectively ensures safety despite the sim-to-real gap while maintaining strong performance.
Researcher Affiliation Academia Yarden As ETH Zurich Chengrui Qu Caltech Benjamin Unger ETH Zurich Dongho Kang ETH Zurich Max van der Hart ETH Zurich Laixi Shi Johns Hopkins University Stelian Coros ETH Zurich Adam Wierman Caltech Andreas Krause ETH Zurich
Pseudocode Yes Algorithm 1 SPi DR: Safe Sim-to-Real via Pessimistic Domain Randomization
Open Source Code Yes A URL with our anonymized open-source code is provided in https: //anonymous.4open.science/r/safe-sim2real-1EAC. The link includes specific installation instructions. Furthermore, for the Unitree Go1 experiments, we provide the policies we deployed on the real platform in https://anonymous.4open.science/r/ safe-sim2real-1EAC.
Open Datasets Yes Finally, we extensively evaluate SPi DR on well established simulated continuous control benchmarks, including the RWRL benchmark (Dulac-Arnold et al., 2020), Safety Gym (Ray et al., 2019) and Race Car environments (Kabzan et al., 2020), where SPi DR consistently satisfies safety constraints while achieving strong task performance.
Dataset Splits Yes Simulating the sim-to-sim gap. We simulate the sim-to-sim gap as follows. In Point Goal2 and the RWRL tasks, we follow a similar approach to Queeney and Benosman (2024) and introduce in evaluation an additional dynamics parameter (e.g., mass or motor gains) that is not encountered during training. In the Race Car environment, the agent is trained using a simplified bicycle model, but evaluated on a more realistic variant that incorporates tire forces and detailed motor dynamics. Further details on the tasks and their sim-to-sim gap design can be found in Appendices J to L. Table 1: Domain randomization parameters and ranges used during training and evaluation. + and denote additive and multiplicative terms respectively. By fixing the damping and mass parameters in training, but not in evaluation, we simulate both lack of knowledge of ΞΎ , but also modeling mismatch due to imperfect simulation. Table 2: Domain randomization parameters and ranges used during training and evaluation across tasks from RWLR. + and denote additive and multiplicative terms respectively.
Hardware Specification Yes All experiments are run using Mu Jo Co XLA (Freeman et al., 2021), enabling us to train SAC for 5M and PPO for 200M environment steps in under an hour on a single NVIDIA RTX 4090 GPU.
Software Dependencies No The paper mentions specific software frameworks and algorithms like MuJoCo XLA, SAC, CRPO, and PPO, but does not provide explicit version numbers for these software components or other dependencies such as programming languages or libraries.
Experiment Setup Yes Unless otherwise specified, all experiments use SAC (Haarnoja et al., 2019; Nauman et al., 2024) in combination with either CRPO (Xu et al., 2021) or a simple primal-dual constrained optimization method (Bertsekas, 2016). Each policy is trained for roughly one hour on an NVIDIA RTX4090 GPU. For both SAC and PPO, we train policies with a primal-dual solver, using different values of Ξ» to penalize the uncertainty. Each Ξ» value is trained across five different random seeds. Target commands are uniformly sampled from the ranges [ 0.45, 0.2, 1.3], corresponding to the forward velocity, lateral velocity, and yaw rate, respectively. Each sampled command is applied for a fixed number of 1000 control steps. All training details for the experiments are provided in Appendices G, H and J to L.