Keeping Your Distance: Solving Sparse Reward Tasks Using Self-Balancing Shaped Rewards

Authors: Alexander Trott, Stephan Zheng, Caiming Xiong, Richard Socher

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that our method successfully solves a variety of hard-exploration tasks (including maze navigation and 3D construction in a Minecraft environment), where naive distancebased reward shaping otherwise fails, and intrinsic curiosity and reward relabeling strategies exhibit poor performance. ... To demonstrate the effectiveness of our method, we apply it to a variety of goal-reaching tasks. We focus on settings where local optima interfere with learning from naive distance-to-goal shaped rewards. We compare this baseline to results using our approach as well as to results using curiosity and reward-relabeling in order to learn from sparse rewards.
Researcher Affiliation Industry Alexander Trott Salesforce Research atrott@salesforce.com Stephan Zheng Salesforce Research stephan.zheng@salesforce.com Caiming Xiong Salesforce Research cxiong@salesforce.com Richard Socher Salesforce Research rsocher@salesforce.com
Pseudocode Yes Algorithm 1: Sibling Rivalry
Open Source Code Yes Reference implementation available at https://github.com/salesforce/sibling-rivalry
Open Datasets No The paper describes custom environments and tasks (e.g., '2D Point Maze', 'U-Maze task with a Mujoco ant agent', '2D bitmap manipulation', '3D construction task in Minecraft') without providing specific public dataset access information or citing well-known public datasets.
Dataset Splits No The paper discusses 'evaluation checkpoints' and 'averaging over 5 experiments', but does not provide specific details on how the data was split into training, validation, and test sets, or reference standard splits.
Hardware Specification No The paper mentions platforms and frameworks used (e.g., 'Mujoco', 'Malmo platform', 'IMPALA framework') but does not provide any specific hardware details such as CPU/GPU models or memory specifications for the experimental setup.
Software Dependencies No The paper mentions various algorithms and frameworks (e.g., 'Proximal Policy Optimization', 'Hindsight Experience Replay', 'DDPG', 'ICM', 'IMPALA') but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup No The paper mentions episode durations for different environments (e.g., 'Episodes have a maximum duration of 50 and 500 environment steps for the 2D Point Maze and Ant Maze, respectively.') and refers to Appendix F for 'detailed descriptions of the environments, tasks, and implementation choices,' but it does not provide specific hyperparameter values or comprehensive training configurations in the main text.