Learning to Shape Rewards Using a Game of Two Partners

Authors: David Mguni, Taher Jafferjee, Jianhong Wang, Nicolas Perez-Nieves, Wenbin Song, Feifei Tong, Matthew Taylor, Tianpei Yang, Zipeng Dai, Hui Chen, Jiangcheng Zhu, Kun Shao, Jun Wang, Yaodong Yang

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate ROSA s properties in three didactic experiments and show its superior performance against state-of-the-art RS algorithms in challenging sparse reward environments. The paper includes sections like Experiments, Didactical Examples, Learning Performance, and Ablation Studies with comparative figures and analysis.
Researcher Affiliation Collaboration 1Huawei R&D 2University of Manchester, UK 3Imperial College London, UK 4University of Alberta, Edmonton, Canada 5Alberta Machine Intelligence Institute, Edmonton, Canada 6Shanghai Tech University, China 7 University College London, UK 8 Peking University, Beijing, China
Pseudocode No The paper refers to 'The full code is in Sec. ?? of the Appendix.' but does not include any pseudocode or algorithm blocks in the main body of the provided text.
Open Source Code No The paper states 'The full code is in Sec. ?? of the Appendix.' but does not provide a direct link to an open-source code repository or an explicit statement that the code is publicly released and accessible in the main paper.
Open Datasets No The paper mentions using well-known environments like 'Super Mario', 'Cartpole', 'Gravitar', 'Solaris', and various 'Maze' environments, but it does not provide specific access information (links, DOIs, or formal citations) for any public datasets used to train models.
Dataset Splits No The paper does not provide specific details on training, validation, or test dataset splits (e.g., exact percentages, sample counts, or citations to predefined splits) used in their experiments.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, memory specifications, or processor types used for running its experiments.
Software Dependencies No The paper mentions using 'proximal policy optimization (PPO)' and refers to 'RND (Burda et al. 2018)' but does not specify version numbers for any software dependencies or libraries.
Experiment Setup No The paper states that PPO was used as the learning algorithm and describes the Shaper's action set and policy structure, but it does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or detailed training configurations in the main text. It refers to 'Precise details are in the Supplementary Material, Section 8.' for more information.