Excluding the Irrelevant: Focusing Reinforcement Learning through Continuous Action Masking
Authors: Roland Stolz, Hanna Krasowski, Jakob Thumm, Michael Eichelbeck, Philipp Gassert, Matthias Althoff
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using proximal policy optimization (PPO), we evaluate our methods on four control tasks, where the relevant action set is computed based on the system dynamics and a relevant state set. Our experiments show that the three action masking methods achieve higher final rewards and converge faster than the baseline without action masking. |
| Researcher Affiliation | Academia | 1Technical University of Munich, 2University of California, Berkeley, 3Munich Center for Machine Learning |
| Pseudocode | No | The paper describes methods and processes but does not include explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We attached the code for the experiments including a detailed readme in the supplementary material to this submission. |
| Open Datasets | Yes | We evaluate our methods on four control tasks... Mujoco Walker2D environment [39]. We extend the environment with a termination criterion, which ends an episode, when the the action violates the constraint a 2 αp... We use the standard parameters of the gymnasium implementation. |
| Dataset Splits | No | The paper mentions hyperparameter optimization but does not explicitly provide details about train/validation/test dataset splits (e.g., percentages or counts) or the methodology for such splits. |
| Hardware Specification | Yes | All experiments are run on a machine with a Intel(R) Xeon(R) Platinum 8380 2.30 GHz processor and 2 TB RAM. |
| Software Dependencies | No | For the experiments, we extend the stable-baseline3 [29] implementation of proximal policy optimization (PPO) [33] by our masking methods. We conduct a hyperparameter optimization with 50 trials for each masking method and environment. The resulting hyperparameters are reported in Appendix A.9. All experiments are run on a machine with a Intel(R) Xeon(R) Platinum 8380 2.30 GHz processor and 2 TB RAM. (No specific software versions like "stable-baseline3 vX.Y.Z") |
| Experiment Setup | Yes | We conduct a hyperparameter optimization with 50 trials for each masking method and environment. The resulting hyperparameters are reported in Appendix A.9. |