reproducibilityindex.ai

Excluding the Irrelevant: Focusing Reinforcement Learning through Continuous Action Masking

Authors: Roland Stolz, Hanna Krasowski, Jakob Thumm, Michael Eichelbeck, Philipp Gassert, Matthias Althoff

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using proximal policy optimization (PPO), we evaluate our methods on four control tasks, where the relevant action set is computed based on the system dynamics and a relevant state set. Our experiments show that the three action masking methods achieve higher final rewards and converge faster than the baseline without action masking.
Researcher Affiliation	Academia	1Technical University of Munich, 2University of California, Berkeley, 3Munich Center for Machine Learning
Pseudocode	No	The paper describes methods and processes but does not include explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We attached the code for the experiments including a detailed readme in the supplementary material to this submission.
Open Datasets	Yes	We evaluate our methods on four control tasks... Mujoco Walker2D environment [39]. We extend the environment with a termination criterion, which ends an episode, when the the action violates the constraint a 2 αp... We use the standard parameters of the gymnasium implementation.
Dataset Splits	No	The paper mentions hyperparameter optimization but does not explicitly provide details about train/validation/test dataset splits (e.g., percentages or counts) or the methodology for such splits.
Hardware Specification	Yes	All experiments are run on a machine with a Intel(R) Xeon(R) Platinum 8380 2.30 GHz processor and 2 TB RAM.
Software Dependencies	No	For the experiments, we extend the stable-baseline3 [29] implementation of proximal policy optimization (PPO) [33] by our masking methods. We conduct a hyperparameter optimization with 50 trials for each masking method and environment. The resulting hyperparameters are reported in Appendix A.9. All experiments are run on a machine with a Intel(R) Xeon(R) Platinum 8380 2.30 GHz processor and 2 TB RAM. (No specific software versions like "stable-baseline3 vX.Y.Z")
Experiment Setup	Yes	We conduct a hyperparameter optimization with 50 trials for each masking method and environment. The resulting hyperparameters are reported in Appendix A.9.