reproducibilityindex.ai

Keeping Your Distance: Solving Sparse Reward Tasks Using Self-Balancing Shaped Rewards

Authors: Alexander Trott, Stephan Zheng, Caiming Xiong, Richard Socher

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that our method successfully solves a variety of hard-exploration tasks (including maze navigation and 3D construction in a Minecraft environment), where naive distancebased reward shaping otherwise fails, and intrinsic curiosity and reward relabeling strategies exhibit poor performance. ... To demonstrate the effectiveness of our method, we apply it to a variety of goal-reaching tasks. We focus on settings where local optima interfere with learning from naive distance-to-goal shaped rewards. We compare this baseline to results using our approach as well as to results using curiosity and reward-relabeling in order to learn from sparse rewards.
Researcher Affiliation	Industry	Alexander Trott Salesforce Research atrott@salesforce.com Stephan Zheng Salesforce Research stephan.zheng@salesforce.com Caiming Xiong Salesforce Research cxiong@salesforce.com Richard Socher Salesforce Research rsocher@salesforce.com
Pseudocode	Yes	Algorithm 1: Sibling Rivalry
Open Source Code	Yes	Reference implementation available at https://github.com/salesforce/sibling-rivalry
Open Datasets	No	The paper describes custom environments and tasks (e.g., '2D Point Maze', 'U-Maze task with a Mujoco ant agent', '2D bitmap manipulation', '3D construction task in Minecraft') without providing specific public dataset access information or citing well-known public datasets.
Dataset Splits	No	The paper discusses 'evaluation checkpoints' and 'averaging over 5 experiments', but does not provide specific details on how the data was split into training, validation, and test sets, or reference standard splits.
Hardware Specification	No	The paper mentions platforms and frameworks used (e.g., 'Mujoco', 'Malmo platform', 'IMPALA framework') but does not provide any specific hardware details such as CPU/GPU models or memory specifications for the experimental setup.
Software Dependencies	No	The paper mentions various algorithms and frameworks (e.g., 'Proximal Policy Optimization', 'Hindsight Experience Replay', 'DDPG', 'ICM', 'IMPALA') but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup	No	The paper mentions episode durations for different environments (e.g., 'Episodes have a maximum duration of 50 and 500 environment steps for the 2D Point Maze and Ant Maze, respectively.') and refers to Appendix F for 'detailed descriptions of the environments, tasks, and implementation choices,' but it does not provide specific hyperparameter values or comprehensive training configurations in the main text.