Hindsight Credit Assignment
Authors: Anna Harutyunyan, Will Dabney, Thomas Mesnard, Mohammad Gheshlaghi Azar, Bilal Piot, Nicolas Heess, Hado P. van Hasselt, Gregory Wayne, Satinder Singh, Doina Precup, Remi Munos
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We study the properties of these algorithms, and empirically show that they successfully address important credit assignment challenges, through a set of illustrative tasks. and 5 Experiments |
| Researcher Affiliation | Industry | Anna Harutyunyan, Will Dabney, Thomas Mesnard, Nicolas Heess, Mohammad G. Azar, Bilal Piot, Hado van Hasselt, Satinder Singh, Greg Wayne, Doina Precup, Rémi Munos DeepMind {harutyunyan, wdabney, munos}@google.com |
| Pseudocode | Yes | See Algorithm 1 in appendix for the detailed pseudocode. and See Algorithm 2 in appendix for the detailed pseudocode. |
| Open Source Code | No | The paper does not provide any explicit statement or link regarding the availability of open-source code for the described methodology. |
| Open Datasets | No | To empirically validate our proposal in a controlled way, we devised a set of diagnostic tasks that highlight issues 1-4, while also being representative of what occurs in practice (Fig. 2). The paper uses custom-devised environments/tasks rather than existing publicly available datasets, and does not provide access information for them. |
| Dataset Splits | No | The paper describes experiments conducted in custom-devised reinforcement learning environments where agents learn through interaction and experience, rather than from static dataset splits. Therefore, the concept of explicit train/validation/test dataset splits as defined for supervised learning does not apply directly to this work. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU/CPU models, processor types, or memory used for running the experiments. |
| Software Dependencies | No | The paper does not provide any specific ancillary software details, such as programming languages, libraries, or solvers with version numbers, needed to replicate the experiments. |
| Experiment Setup | Yes | For simplicity we take γ = 1 in all of the tasks. and All the results are an average of 100 independent runs, with the plots depicting means and standard deviations. and The policy between long and short paths initialized uniformly. and Middle: Using full Monte Carlo returns (for n = 3) overcomes partial observability, but is prone to noise. The plot depicts learning curves for the setting with added white noise of σ = 2. and Ambiguous bandit with Gaussian rewards of means 1, 2, and standard deviation 1.5. and The algorithm proceeds by training V (Xs) to predict the usual return Zs...and ˆr(Xs, As) to predict Rs (square loss), the hindsight distribution hβ(a|Xs, Xt) to predict As (cross entropy loss), and finally by updating the policy logits... |