reproducibilityindex.ai

Learning to Shape Rewards Using a Game of Two Partners

Authors: David Mguni, Taher Jafferjee, Jianhong Wang, Nicolas Perez-Nieves, Wenbin Song, Feifei Tong, Matthew Taylor, Tianpei Yang, Zipeng Dai, Hui Chen, Jiangcheng Zhu, Kun Shao, Jun Wang, Yaodong Yang

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate ROSA s properties in three didactic experiments and show its superior performance against state-of-the-art RS algorithms in challenging sparse reward environments. The paper includes sections like Experiments, Didactical Examples, Learning Performance, and Ablation Studies with comparative figures and analysis.
Researcher Affiliation	Collaboration	1Huawei R&D 2University of Manchester, UK 3Imperial College London, UK 4University of Alberta, Edmonton, Canada 5Alberta Machine Intelligence Institute, Edmonton, Canada 6Shanghai Tech University, China 7 University College London, UK 8 Peking University, Beijing, China
Pseudocode	No	The paper refers to 'The full code is in Sec. ?? of the Appendix.' but does not include any pseudocode or algorithm blocks in the main body of the provided text.
Open Source Code	No	The paper states 'The full code is in Sec. ?? of the Appendix.' but does not provide a direct link to an open-source code repository or an explicit statement that the code is publicly released and accessible in the main paper.
Open Datasets	No	The paper mentions using well-known environments like 'Super Mario', 'Cartpole', 'Gravitar', 'Solaris', and various 'Maze' environments, but it does not provide specific access information (links, DOIs, or formal citations) for any public datasets used to train models.
Dataset Splits	No	The paper does not provide specific details on training, validation, or test dataset splits (e.g., exact percentages, sample counts, or citations to predefined splits) used in their experiments.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, memory specifications, or processor types used for running its experiments.
Software Dependencies	No	The paper mentions using 'proximal policy optimization (PPO)' and refers to 'RND (Burda et al. 2018)' but does not specify version numbers for any software dependencies or libraries.
Experiment Setup	No	The paper states that PPO was used as the learning algorithm and describes the Shaper's action set and policy structure, but it does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or detailed training configurations in the main text. It refers to 'Precise details are in the Supplementary Material, Section 8.' for more information.