Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Keeping Your Distance: Solving Sparse Reward Tasks Using Self-Balancing Shaped Rewards
Authors: Alexander Trott, Stephan Zheng, Caiming Xiong, Richard Socher
NeurIPS 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that our method successfully solves a variety of hard-exploration tasks (including maze navigation and 3D construction in a Minecraft environment), where naive distancebased reward shaping otherwise fails, and intrinsic curiosity and reward relabeling strategies exhibit poor performance. ... To demonstrate the effectiveness of our method, we apply it to a variety of goal-reaching tasks. We focus on settings where local optima interfere with learning from naive distance-to-goal shaped rewards. We compare this baseline to results using our approach as well as to results using curiosity and reward-relabeling in order to learn from sparse rewards. |
| Researcher Affiliation | Industry | Alexander Trott Salesforce Research EMAIL Stephan Zheng Salesforce Research EMAIL Caiming Xiong Salesforce Research EMAIL Richard Socher Salesforce Research EMAIL |
| Pseudocode | Yes | Algorithm 1: Sibling Rivalry |
| Open Source Code | Yes | Reference implementation available at https://github.com/salesforce/sibling-rivalry |
| Open Datasets | No | The paper describes custom environments and tasks (e.g., '2D Point Maze', 'U-Maze task with a Mujoco ant agent', '2D bitmap manipulation', '3D construction task in Minecraft') without providing specific public dataset access information or citing well-known public datasets. |
| Dataset Splits | No | The paper discusses 'evaluation checkpoints' and 'averaging over 5 experiments', but does not provide specific details on how the data was split into training, validation, and test sets, or reference standard splits. |
| Hardware Specification | No | The paper mentions platforms and frameworks used (e.g., 'Mujoco', 'Malmo platform', 'IMPALA framework') but does not provide any specific hardware details such as CPU/GPU models or memory specifications for the experimental setup. |
| Software Dependencies | No | The paper mentions various algorithms and frameworks (e.g., 'Proximal Policy Optimization', 'Hindsight Experience Replay', 'DDPG', 'ICM', 'IMPALA') but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | No | The paper mentions episode durations for different environments (e.g., 'Episodes have a maximum duration of 50 and 500 environment steps for the 2D Point Maze and Ant Maze, respectively.') and refers to Appendix F for 'detailed descriptions of the environments, tasks, and implementation choices,' but it does not provide specific hyperparameter values or comprehensive training configurations in the main text. |