Reward Propagation Using Graph Convolutional Networks

Authors: Martin Klissarov, Doina Precup

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We verify empirically that our approach can achieve considerable improvements in both small and high-dimensional control problems. We first evaluate our approach in tabular domains where we achieve similar performance compared to potential based reward shaping built on the forward-backward algorithm. Unlike hand-engineered potential functions, our method scales naturally to more complex environments; we illustrate this on navigation-based vision tasks from the Mini World environment [Chevalier-Boisvert, 2018], on a variety of games from the Atari 2600 benchmark [Bellemare et al., 2012] and on a set of continuous control environments based on Mu Jo Co [Todorov et al., 2012] , where our method fares significantly better than actor-critic algorithms [Sutton et al., 1999a, Schulman et al., 2017] and additional baselines.
Researcher Affiliation Collaboration Martin Klissarov Mila, Mc Gill University martin.klissarov@mail.mcgill.ca Doina Precup Mila, Mc Gill University and Deep Mind dprecup@cs.mcgill.ca
Pseudocode Yes Algorithm 1: Reward shaping using GCNs
Open Source Code No The paper does not explicitly state that its own code for the described methodology is open-source or provide a link to it. It only references third-party open-source implementations used as baselines.
Open Datasets Yes Mini World environment [Chevalier-Boisvert, 2018], Atari 2600 benchmark [Bellemare et al., 2012], Mu Jo Co [Todorov et al., 2012], CIFAR-10 images [Krizhevsky et al.]
Dataset Splits No The paper mentions 'Validation accuracy on the Cora dataset' in Figure 2b, but it does not provide specific training/validation/test splits (percentages, counts, or references to predefined splits) for the main experimental environments (Mini World, Atari, MuJoCo).
Hardware Specification Yes We did these evaluations on a single V100 GPU, 8 CPUs and 40GB of RAM.
Software Dependencies No The paper mentions software like 'Pytorch' but does not provide specific version numbers for any libraries, frameworks, or solvers used in the experiments.
Experiment Setup Yes All details about hyperparameters and network architectures are provided in the Appendix A.2. An important hyperparameter in our approach is effectively α which trades-off between the reward shaped return and the default return. We also investigate the role of η, the hyperparameter trading-off between the two losses of the GCN, in Appendix A.4