Deep Reinforcement and InfoMax Learning
Authors: Bogdan Mazoure, Remi Tachet des Combes, Thang Long Doan, Philip Bachman, R Devon Hjelm
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test our approach in several synthetic settings, where it successfully learns representations that are predictive of the future. Finally, we augment C51, a strong RL baseline, with our temporal DIM objective and demonstrate improved performance on a continual learning task and on the recently introduced Procgen environment. ... In this section, we first show how our proposed objective can be used to estimate state similarity in single Markov chains. We then show that DRIML can capture dynamics in locally deterministic systems (Ising model), which is useful in domains with partially deterministic transitions. We then provide results on a continual version of the Ms. Pac Man game where the DIM loss is shown to converge faster for more deterministic tasks, and to help in a continual learning setting. Finally, we provide results on Procgen [Cobbe et al., 2019], which show that DRIML performs well when trained on 500 levels with fixed order. |
| Researcher Affiliation | Collaboration | Bogdan Mazoure1 Mc Gill University, Mila Rémi Tachet des Combes Microsoft Research Montréal Thang Doan Mc Gill University, Mila Philip Bachman Microsoft Research Montréal R Devon Hjelm Microsoft Research Montréal Université de Montréal, Mila |
| Pseudocode | Yes | Algorithm 1: Deep Reinforcement and Info Max Learning (DRIML) ... Algorithm 2: Adaptive lookahead selection |
| Open Source Code | Yes | Compute LNt M DIM using Eq. 5 (see Appendix 8.5 for Py Torch code); |
| Open Datasets | Yes | Ms. Pac Man from the Atari Learning Environment [ALE, Bellemare et al., 2013], and all 16 games from the Procgen suite [Cobbe et al., 2019]. |
| Dataset Splits | No | The paper does not explicitly provide details about training/validation/test splits, such as specific percentages or sample counts. It describes the total training frames (50M) and the use of an experience replay buffer, but no distinct validation split information. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for experiments, such as specific GPU or CPU models, or cloud computing instance types with detailed specifications. |
| Software Dependencies | No | The paper mentions 'Py Torch code' but does not specify version numbers for PyTorch or any other software dependencies, which are necessary for full reproducibility. |
| Experiment Setup | Yes | All experimental details can be found in Appendix 8.6. ... The optimizer is Adam [Kingma and Ba, 2014] with a learning rate of 2.5e-4. The epsilon value in the -greedy policy goes from 1.0 to 0.01 over 2.5M frames. We use a minibatch size 32, a replay buffer size 100k, a target network updated every 1k frames, a discount factor = 0.99 and the number of atoms = 51 for the C51 agent. The number of steps in n-step Q-learning is 5. All algorithms are trained for 50M environment frames. |