Deep Reinforcement and InfoMax Learning

Authors: Bogdan Mazoure, Remi Tachet des Combes, Thang Long Doan, Philip Bachman, R Devon Hjelm

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test our approach in several synthetic settings, where it successfully learns representations that are predictive of the future. Finally, we augment C51, a strong RL baseline, with our temporal DIM objective and demonstrate improved performance on a continual learning task and on the recently introduced Procgen environment. ... In this section, we first show how our proposed objective can be used to estimate state similarity in single Markov chains. We then show that DRIML can capture dynamics in locally deterministic systems (Ising model), which is useful in domains with partially deterministic transitions. We then provide results on a continual version of the Ms. Pac Man game where the DIM loss is shown to converge faster for more deterministic tasks, and to help in a continual learning setting. Finally, we provide results on Procgen [Cobbe et al., 2019], which show that DRIML performs well when trained on 500 levels with fixed order.
Researcher Affiliation Collaboration Bogdan Mazoure1 Mc Gill University, Mila Rémi Tachet des Combes Microsoft Research Montréal Thang Doan Mc Gill University, Mila Philip Bachman Microsoft Research Montréal R Devon Hjelm Microsoft Research Montréal Université de Montréal, Mila
Pseudocode Yes Algorithm 1: Deep Reinforcement and Info Max Learning (DRIML) ... Algorithm 2: Adaptive lookahead selection
Open Source Code Yes Compute LNt M DIM using Eq. 5 (see Appendix 8.5 for Py Torch code);
Open Datasets Yes Ms. Pac Man from the Atari Learning Environment [ALE, Bellemare et al., 2013], and all 16 games from the Procgen suite [Cobbe et al., 2019].
Dataset Splits No The paper does not explicitly provide details about training/validation/test splits, such as specific percentages or sample counts. It describes the total training frames (50M) and the use of an experience replay buffer, but no distinct validation split information.
Hardware Specification No The paper does not explicitly describe the hardware used for experiments, such as specific GPU or CPU models, or cloud computing instance types with detailed specifications.
Software Dependencies No The paper mentions 'Py Torch code' but does not specify version numbers for PyTorch or any other software dependencies, which are necessary for full reproducibility.
Experiment Setup Yes All experimental details can be found in Appendix 8.6. ... The optimizer is Adam [Kingma and Ba, 2014] with a learning rate of 2.5e-4. The epsilon value in the -greedy policy goes from 1.0 to 0.01 over 2.5M frames. We use a minibatch size 32, a replay buffer size 100k, a target network updated every 1k frames, a discount factor = 0.99 and the number of atoms = 51 for the C51 agent. The number of steps in n-step Q-learning is 5. All algorithms are trained for 50M environment frames.