Excluding the Irrelevant: Focusing Reinforcement Learning through Continuous Action Masking

Authors: Roland Stolz, Hanna Krasowski, Jakob Thumm, Michael Eichelbeck, Philipp Gassert, Matthias Althoff

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using proximal policy optimization (PPO), we evaluate our methods on four control tasks, where the relevant action set is computed based on the system dynamics and a relevant state set. Our experiments show that the three action masking methods achieve higher final rewards and converge faster than the baseline without action masking.
Researcher Affiliation Academia 1Technical University of Munich, 2University of California, Berkeley, 3Munich Center for Machine Learning
Pseudocode No The paper describes methods and processes but does not include explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes We attached the code for the experiments including a detailed readme in the supplementary material to this submission.
Open Datasets Yes We evaluate our methods on four control tasks... Mujoco Walker2D environment [39]. We extend the environment with a termination criterion, which ends an episode, when the the action violates the constraint a 2 αp... We use the standard parameters of the gymnasium implementation.
Dataset Splits No The paper mentions hyperparameter optimization but does not explicitly provide details about train/validation/test dataset splits (e.g., percentages or counts) or the methodology for such splits.
Hardware Specification Yes All experiments are run on a machine with a Intel(R) Xeon(R) Platinum 8380 2.30 GHz processor and 2 TB RAM.
Software Dependencies No For the experiments, we extend the stable-baseline3 [29] implementation of proximal policy optimization (PPO) [33] by our masking methods. We conduct a hyperparameter optimization with 50 trials for each masking method and environment. The resulting hyperparameters are reported in Appendix A.9. All experiments are run on a machine with a Intel(R) Xeon(R) Platinum 8380 2.30 GHz processor and 2 TB RAM. (No specific software versions like "stable-baseline3 vX.Y.Z")
Experiment Setup Yes We conduct a hyperparameter optimization with 50 trials for each masking method and environment. The resulting hyperparameters are reported in Appendix A.9.