Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Solving Minimum-Cost Reach Avoid using Reinforcement Learning
Authors: Oswin So, Cheng Ge, Chuchu Fan
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results demonstrate that RC-PPO learns policies with comparable goal-reaching rates to while achieving up to 57% lower cumulative costs compared to existing methods on a suite of minimum-cost reach-avoid benchmarks on the Mujoco simulator. |
| Researcher Affiliation | Academia | Oswin So* Department of Aeronautics and Astronautics MIT EMAIL Cheng Ge* Department of Aeronautics and Astronautics MIT EMAIL Chuchu Fan Department of Aeronautics and Astronautics MIT EMAIL |
| Pseudocode | Yes | Algorithm 1 RC-PPO (Actor Critic) |
| Open Source Code | Yes | The project page can be found at https://oswinso.xyz/rcppo/. (...) Yes, the code used for generating the results in the paper has been provided. |
| Open Datasets | Yes | We compare RC-PPO with baseline methods on several minimum-cost reach-avoid environments. We consider an inverted pendulum (Pendulum), an environment from Safety Gym [69] (Point Goal) and two custom environments from Mu Jo Co [70], (Safety Hopper, Safety Half Cheetah) with added hazard regions and goal regions. We also consider a 3D quadrotor navigation task in a simulated wind field for an urban environment [71, 72] (Wind Field) and an Fixed-Wing avoid task from [59] with an additional goal region (Fixed Wing). |
| Dataset Splits | No | The paper focuses on reinforcement learning environments and does not describe traditional dataset splits for training, validation, or testing. |
| Hardware Specification | Yes | We run all our experiments on a computer with CPU AMD Ryzen Threadripper 3970X 32-Core Processor and with 4 GPUs of RTX3090. |
| Software Dependencies | No | Also, we implement all the environments in Jax [76] for better scalability and parallelization. The specific version number for Jax is not provided. |
| Experiment Setup | Yes | Table 1: Hyperparameter Settings for On-policy Algorithms lists specific values for MLP Units per Hidden Layer 256, Numbers of Hidden Layers 2, Discount factor γ 0.99, Clip Ratio 0.2, and learning rates. Section F.2 provides details on Xthreshold, β, and Cfail values. |