Keeping Your Distance: Solving Sparse Reward Tasks Using Self-Balancing Shaped Rewards
Authors: Alexander Trott, Stephan Zheng, Caiming Xiong, Richard Socher
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that our method successfully solves a variety of hard-exploration tasks (including maze navigation and 3D construction in a Minecraft environment), where naive distancebased reward shaping otherwise fails, and intrinsic curiosity and reward relabeling strategies exhibit poor performance. ... To demonstrate the effectiveness of our method, we apply it to a variety of goal-reaching tasks. We focus on settings where local optima interfere with learning from naive distance-to-goal shaped rewards. We compare this baseline to results using our approach as well as to results using curiosity and reward-relabeling in order to learn from sparse rewards. |
| Researcher Affiliation | Industry | Alexander Trott Salesforce Research atrott@salesforce.com Stephan Zheng Salesforce Research stephan.zheng@salesforce.com Caiming Xiong Salesforce Research cxiong@salesforce.com Richard Socher Salesforce Research rsocher@salesforce.com |
| Pseudocode | Yes | Algorithm 1: Sibling Rivalry |
| Open Source Code | Yes | Reference implementation available at https://github.com/salesforce/sibling-rivalry |
| Open Datasets | No | The paper describes custom environments and tasks (e.g., '2D Point Maze', 'U-Maze task with a Mujoco ant agent', '2D bitmap manipulation', '3D construction task in Minecraft') without providing specific public dataset access information or citing well-known public datasets. |
| Dataset Splits | No | The paper discusses 'evaluation checkpoints' and 'averaging over 5 experiments', but does not provide specific details on how the data was split into training, validation, and test sets, or reference standard splits. |
| Hardware Specification | No | The paper mentions platforms and frameworks used (e.g., 'Mujoco', 'Malmo platform', 'IMPALA framework') but does not provide any specific hardware details such as CPU/GPU models or memory specifications for the experimental setup. |
| Software Dependencies | No | The paper mentions various algorithms and frameworks (e.g., 'Proximal Policy Optimization', 'Hindsight Experience Replay', 'DDPG', 'ICM', 'IMPALA') but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | No | The paper mentions episode durations for different environments (e.g., 'Episodes have a maximum duration of 50 and 500 environment steps for the 2D Point Maze and Ant Maze, respectively.') and refers to Appendix F for 'detailed descriptions of the environments, tasks, and implementation choices,' but it does not provide specific hyperparameter values or comprehensive training configurations in the main text. |