Explicable Reward Design for Reinforcement Learning Agents

Authors: Rati Devidze, Goran Radanovic, Parameswaran Kamalaruban, Adish Singla

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on two navigation tasks demonstrate the effectiveness of EXPRD in designing explicable reward functions.
Researcher Affiliation Academia Rati Devidze1 Goran Radanovic1 Parameswaran Kamalaruban2 Adish Singla1 1Max Planck Institute for Software Systems (MPI-SWS), Saarbrucken, Germany 2The Alan Turing Institute, London, UK
Pseudocode Yes Algorithm 1 Iterative Greedy Algorithm for EXPRD
Open Source Code Yes 1Github repo: https://github.com/adishs/neurips2021_explicable-reward-design_code.
Open Datasets No The paper describes custom-built simulation environments (ROOMSNAVENV and LINEKEYNAVENV) rather than using pre-existing public datasets. It does not provide access information for these environments as datasets.
Dataset Splits Yes All the results are reported as average over 40 runs and convergence plots show mean with standard error bars.
Hardware Specification No The paper states that hardware details are provided in the Appendix of the supplementary material, which is not part of the provided text for analysis.
Software Dependencies No The paper states that software dependency details are provided in the Appendix of the supplementary material, which is not part of the provided text for analysis. It mentions using "standard Q-learning method" but no specific software versions.
Experiment Setup Yes We use standard Q-learning method for the agent with a learning rate 0.5 and exploration factor 0.1 [7]. During training, the agent receives rewards based on b R, however, is evaluated based on R. A training episode ends when the maximum steps (set to 50) is reached or an agent s action terminates the episode. All the results are reported as average over 40 runs and convergence plots show mean with standard error bars.