Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Excluding the Irrelevant: Focusing Reinforcement Learning through Continuous Action Masking
Authors: Roland Stolz, Hanna Krasowski, Jakob Thumm, Michael Eichelbeck, Philipp Gassert, Matthias Althoff
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using proximal policy optimization (PPO), we evaluate our methods on four control tasks, where the relevant action set is computed based on the system dynamics and a relevant state set. Our experiments show that the three action masking methods achieve higher final rewards and converge faster than the baseline without action masking. |
| Researcher Affiliation | Academia | 1Technical University of Munich, 2University of California, Berkeley, 3Munich Center for Machine Learning |
| Pseudocode | No | The paper describes methods and processes but does not include explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We attached the code for the experiments including a detailed readme in the supplementary material to this submission. |
| Open Datasets | Yes | We evaluate our methods on four control tasks... Mujoco Walker2D environment [39]. We extend the environment with a termination criterion, which ends an episode, when the the action violates the constraint a 2 Îąp... We use the standard parameters of the gymnasium implementation. |
| Dataset Splits | No | The paper mentions hyperparameter optimization but does not explicitly provide details about train/validation/test dataset splits (e.g., percentages or counts) or the methodology for such splits. |
| Hardware Specification | Yes | All experiments are run on a machine with a Intel(R) Xeon(R) Platinum 8380 2.30 GHz processor and 2 TB RAM. |
| Software Dependencies | No | For the experiments, we extend the stable-baseline3 [29] implementation of proximal policy optimization (PPO) [33] by our masking methods. We conduct a hyperparameter optimization with 50 trials for each masking method and environment. The resulting hyperparameters are reported in Appendix A.9. All experiments are run on a machine with a Intel(R) Xeon(R) Platinum 8380 2.30 GHz processor and 2 TB RAM. (No specific software versions like "stable-baseline3 vX.Y.Z") |
| Experiment Setup | Yes | We conduct a hyperparameter optimization with 50 trials for each masking method and environment. The resulting hyperparameters are reported in Appendix A.9. |