Hindsight Credit Assignment

Authors: Anna Harutyunyan, Will Dabney, Thomas Mesnard, Mohammad Gheshlaghi Azar, Bilal Piot, Nicolas Heess, Hado P. van Hasselt, Gregory Wayne, Satinder Singh, Doina Precup, Remi Munos

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We study the properties of these algorithms, and empirically show that they successfully address important credit assignment challenges, through a set of illustrative tasks. and 5 Experiments
Researcher Affiliation Industry Anna Harutyunyan, Will Dabney, Thomas Mesnard, Nicolas Heess, Mohammad G. Azar, Bilal Piot, Hado van Hasselt, Satinder Singh, Greg Wayne, Doina Precup, Rémi Munos DeepMind {harutyunyan, wdabney, munos}@google.com
Pseudocode Yes See Algorithm 1 in appendix for the detailed pseudocode. and See Algorithm 2 in appendix for the detailed pseudocode.
Open Source Code No The paper does not provide any explicit statement or link regarding the availability of open-source code for the described methodology.
Open Datasets No To empirically validate our proposal in a controlled way, we devised a set of diagnostic tasks that highlight issues 1-4, while also being representative of what occurs in practice (Fig. 2). The paper uses custom-devised environments/tasks rather than existing publicly available datasets, and does not provide access information for them.
Dataset Splits No The paper describes experiments conducted in custom-devised reinforcement learning environments where agents learn through interaction and experience, rather than from static dataset splits. Therefore, the concept of explicit train/validation/test dataset splits as defined for supervised learning does not apply directly to this work.
Hardware Specification No The paper does not provide any specific hardware details such as GPU/CPU models, processor types, or memory used for running the experiments.
Software Dependencies No The paper does not provide any specific ancillary software details, such as programming languages, libraries, or solvers with version numbers, needed to replicate the experiments.
Experiment Setup Yes For simplicity we take γ = 1 in all of the tasks. and All the results are an average of 100 independent runs, with the plots depicting means and standard deviations. and The policy between long and short paths initialized uniformly. and Middle: Using full Monte Carlo returns (for n = 3) overcomes partial observability, but is prone to noise. The plot depicts learning curves for the setting with added white noise of σ = 2. and Ambiguous bandit with Gaussian rewards of means 1, 2, and standard deviation 1.5. and The algorithm proceeds by training V (Xs) to predict the usual return Zs...and ˆr(Xs, As) to predict Rs (square loss), the hindsight distribution hβ(a|Xs, Xt) to predict As (cross entropy loss), and finally by updating the policy logits...