Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Efficient Safe Meta-Reinforcement Learning: Provable Near-Optimality and Anytime Safety

Authors: Siyuan Xu, Minghui Zhu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, our algorithm achieves superior optimality, strict safety compliance, and substantial computational gains up to 70% faster training and 50% faster testing across diverse locomotion and navigation benchmarks. We conduct experiments on seven scenarios including navigation tasks with collision avoidance and locomotion tasks to verify these advantages of the proposed algorithms.
Researcher Affiliation Academia Siyuan Xu & Minghui Zhu School of Electrical Engineering and Computer Science The Pennsylvania State University University Park, PA 16801 EMAIL
Pseudocode Yes Algorithm 1 Safe meta-policy training algorithm Algorithm 2 states the algorithm for the safe policy adaptation. Algorithm 3 Safe policy adaptation algorithm with the first-order approximation Algorithm 4 An alternative algorithm of meta-training
Open Source Code Yes We provide open access to the data and code with sufficient instructions in the supplemental material.
Open Datasets Yes We conduct experiments on four high-dimensional locomotion scenarios, including Half-Cheetah, Humanoid, Hopper, Swimmer, and three navigation scenarios with collision avoidance, including Point-Circle, Car-Circle-Hazard, and Point-Button in Gym and Safety-Gymnasium libraries [10, 23].
Dataset Splits Yes In each iteration, we sample 10 tasks from the task distribution. Therefore, for each meta-training iteration, the number of the sampled state-action pairs is 50k or 80k. The models are trained for up to 300 meta-iterations in the meta-training. Therefore, the overall number of sampled state-action pairs is from 15M or 24M. The meta-policy is tested on 20 tasks and is adapted by 20 iterations for each task in the meta-test.
Hardware Specification Yes All experiments are executed on a computer with a 5.20 GHz Intel Core i12 CPU.
Software Dependencies No The paper mentions "Gym library [10]", "Safety-Gymnasium library [23]", and "TRPO [44]" but does not provide specific version numbers for these software components. Therefore, the information is insufficient for full reproducibility regarding software dependencies.
Experiment Setup Yes The neural network policy has two hidden layers of size 64, with tanh nonlinearities. The horizon is 200, with 40 rollouts per policy adaptation step for all problems in the high-dimensional locomotion scenarios. The horizon is 500, with 10 rollouts per policy adaptation step for all problems in the navigation scenarios. The discount factor γ = 0.99. In each iteration, we sample 10 tasks from the task distribution. ... For the TRPO in meta-parameter optimization, we use the KL-divergence constraint as δ = 1e-3. We set λ = λc1 in the safe policy adaptation As in problem (1). Table 2 shows the setting of λ and dτ in As for each scenario.