Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Keeping Your Distance: Solving Sparse Reward Tasks Using Self-Balancing Shaped Rewards

Authors: Alexander Trott, Stephan Zheng, Caiming Xiong, Richard Socher

NeurIPS 2019 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that our method successfully solves a variety of hard-exploration tasks (including maze navigation and 3D construction in a Minecraft environment), where naive distancebased reward shaping otherwise fails, and intrinsic curiosity and reward relabeling strategies exhibit poor performance. ... To demonstrate the effectiveness of our method, we apply it to a variety of goal-reaching tasks. We focus on settings where local optima interfere with learning from naive distance-to-goal shaped rewards. We compare this baseline to results using our approach as well as to results using curiosity and reward-relabeling in order to learn from sparse rewards.
Researcher Affiliation Industry Alexander Trott Salesforce Research EMAIL Stephan Zheng Salesforce Research EMAIL Caiming Xiong Salesforce Research EMAIL Richard Socher Salesforce Research EMAIL
Pseudocode Yes Algorithm 1: Sibling Rivalry
Open Source Code Yes Reference implementation available at https://github.com/salesforce/sibling-rivalry
Open Datasets No The paper describes custom environments and tasks (e.g., '2D Point Maze', 'U-Maze task with a Mujoco ant agent', '2D bitmap manipulation', '3D construction task in Minecraft') without providing specific public dataset access information or citing well-known public datasets.
Dataset Splits No The paper discusses 'evaluation checkpoints' and 'averaging over 5 experiments', but does not provide specific details on how the data was split into training, validation, and test sets, or reference standard splits.
Hardware Specification No The paper mentions platforms and frameworks used (e.g., 'Mujoco', 'Malmo platform', 'IMPALA framework') but does not provide any specific hardware details such as CPU/GPU models or memory specifications for the experimental setup.
Software Dependencies No The paper mentions various algorithms and frameworks (e.g., 'Proximal Policy Optimization', 'Hindsight Experience Replay', 'DDPG', 'ICM', 'IMPALA') but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup No The paper mentions episode durations for different environments (e.g., 'Episodes have a maximum duration of 50 and 500 environment steps for the 2D Point Maze and Ant Maze, respectively.') and refers to Appendix F for 'detailed descriptions of the environments, tasks, and implementation choices,' but it does not provide specific hyperparameter values or comprehensive training configurations in the main text.