Bridging Imagination and Reality for Model-Based Deep Reinforcement Learning

Authors: Guangxiang Zhu, Minghao Zhang, Honglak Lee, Chongjie Zhang

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on challenging visual control benchmarks (Deep Mind Control Suite with image inputs [30]) and the results demonstrate that BIRD achieves state-of-the-art performance in terms of sample efficiency. Our ablation study further verifies the superiority of BIRD benefits from mutual information maximization rather than from the increase of policy entropy.
Researcher Affiliation Academia Guangxiang Zhu IIIS Tsinghua University guangxiangzhu@outlook.com Minghao Zhang School of Software Tsinghua University mehoozhang@gmail.com Honglak Lee EECS University of Michigan honglak@eecs.umich.edu Chongjie Zhang IIIS Tsinghua University chongjie@tsinghua.edu.cn
Pseudocode Yes Algorithm 1 summarizes our entire algorithm of optimizing mutual information and policy. Algorithm 1 BIRD Algorithm
Open Source Code No No explicit statement about releasing the code for the described method or a link to its repository was found. The paper mentions: "We implement Dreamer by its released codes (https://github.com/google-research/dreamer)", which refers to a baseline, not their own work.
Open Datasets Yes We evaluate BIRD on Deep Mind Control Suite (https://github.com/deepmind/dm_control) [30], a standard benchmark for continuous control.
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits, only stating: "Among all environments, observations are 64 64 3 images, rewards are scaled to 0 to 1, and the dimensions of action space vary from 1 to 12 . Action repeat is fixed at 2 for all tasks." It also mentions "Buffersize is 100k" which implies data usage but not specific splits.
Hardware Specification Yes We train BIRD with a single Nvidia 2080Ti and a single CPU, and it takes 8 hours to run 1 million samples.
Software Dependencies No The paper mentions: "Policy network, reward network, and value network are all implemented with multi-layer perceptrons (MLP) and they respectively trained with Adam optimizer [65]." It mentions Adam optimizer but does not provide a version number. It also mentions "CNN layers" and "GRU [64]" but no versions.
Experiment Setup Yes Among all environments, observations are 64 64 3 images, rewards are scaled to 0 to 1, and the dimensions of action space vary from 1 to 12 . Action repeat is fixed at 2 for all tasks. We implement Dreamer by its released codes (https://github.com/google-research/dreamer) and all hyper-parameters remain the same as reported... For all experiments, we select a discount factor of 0.99 and a mutual information coefficient of 1e-8. Buffersize is 100k.