Intrinsic Motivation for Encouraging Synergistic Behavior

Authors: Rohan Chitnis, Shubham Tulsiani, Saurabh Gupta, Abhinav Gupta

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our approach in robotic bimanual manipulation and multi-agent locomotion tasks with sparse rewards; we find that our approach yields more efficient learning than both 1) training with only the sparse reward and 2) using the typical surprise-based formulation of intrinsic motivation, which does not bias toward synergistic behavior. Videos are available on the project webpage: https://sites.google.com/view/iclr2020-synergistic.
Researcher Affiliation Collaboration MIT Computer Science and Artificial Intelligence Laboratory, Facebook Artificial Intelligence Research ronuchit@mit.edu, shubhtuls@fb.com, saurabhg@illinois.edu, gabhinav@fb.com
Pseudocode Yes Full pseudocode is provided in Appendix A.
Open Source Code No The paper does not explicitly state that the source code for their methodology is released, nor does it provide a direct link to a code repository for their work. It only links to a project webpage for videos and mentions using third-party libraries like stable baselines.
Open Datasets No The paper uses custom simulated robotic and multi-agent locomotion tasks (e.g., bottle opening, ant push, soccer) within MuJoCo. These are custom environments, not publicly available datasets in the traditional sense, and no link or specific access information for pre-generated data is provided.
Dataset Splits No The paper evaluates performance based on interactions with simulated environments and does not specify traditional train/validation/test dataset splits with percentages or sample counts, as it generates data dynamically.
Hardware Specification No The paper states 'For all tasks, training is parallelized across 50 workers' but does not specify any particular hardware details such as GPU models, CPU models, or memory.
Software Dependencies No The paper mentions software like 'MuJoCo', 'Surreal Robotics Suite', and 'stable baselines' but does not provide specific version numbers for any of them.
Experiment Setup Yes We set the trade-off coefficient λ = 10 (see Appendix D). We use the stable baselines (Hill et al., 2018) implementation of PPO (Schulman et al., 2017) as our policy gradient algorithm. We use clipping parameter 0.2, entropy loss coefficient 0.01, value loss function coefficient 0.5, gradient clip threshold 0.5, number of steps 10, number of minibatches per update 4, number of optimization epochs per update 4, and Adam (Kingma & Ba, 2015) with learning rate 0.001.