Explicable Reward Design for Reinforcement Learning Agents
Authors: Rati Devidze, Goran Radanovic, Parameswaran Kamalaruban, Adish Singla
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on two navigation tasks demonstrate the effectiveness of EXPRD in designing explicable reward functions. |
| Researcher Affiliation | Academia | Rati Devidze1 Goran Radanovic1 Parameswaran Kamalaruban2 Adish Singla1 1Max Planck Institute for Software Systems (MPI-SWS), Saarbrucken, Germany 2The Alan Turing Institute, London, UK |
| Pseudocode | Yes | Algorithm 1 Iterative Greedy Algorithm for EXPRD |
| Open Source Code | Yes | 1Github repo: https://github.com/adishs/neurips2021_explicable-reward-design_code. |
| Open Datasets | No | The paper describes custom-built simulation environments (ROOMSNAVENV and LINEKEYNAVENV) rather than using pre-existing public datasets. It does not provide access information for these environments as datasets. |
| Dataset Splits | Yes | All the results are reported as average over 40 runs and convergence plots show mean with standard error bars. |
| Hardware Specification | No | The paper states that hardware details are provided in the Appendix of the supplementary material, which is not part of the provided text for analysis. |
| Software Dependencies | No | The paper states that software dependency details are provided in the Appendix of the supplementary material, which is not part of the provided text for analysis. It mentions using "standard Q-learning method" but no specific software versions. |
| Experiment Setup | Yes | We use standard Q-learning method for the agent with a learning rate 0.5 and exploration factor 0.1 [7]. During training, the agent receives rewards based on b R, however, is evaluated based on R. A training episode ends when the maximum steps (set to 50) is reached or an agent s action terminates the episode. All the results are reported as average over 40 runs and convergence plots show mean with standard error bars. |