Bridging Imagination and Reality for Model-Based Deep Reinforcement Learning
Authors: Guangxiang Zhu, Minghao Zhang, Honglak Lee, Chongjie Zhang
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on challenging visual control benchmarks (Deep Mind Control Suite with image inputs [30]) and the results demonstrate that BIRD achieves state-of-the-art performance in terms of sample efficiency. Our ablation study further verifies the superiority of BIRD benefits from mutual information maximization rather than from the increase of policy entropy. |
| Researcher Affiliation | Academia | Guangxiang Zhu IIIS Tsinghua University guangxiangzhu@outlook.com Minghao Zhang School of Software Tsinghua University mehoozhang@gmail.com Honglak Lee EECS University of Michigan honglak@eecs.umich.edu Chongjie Zhang IIIS Tsinghua University chongjie@tsinghua.edu.cn |
| Pseudocode | Yes | Algorithm 1 summarizes our entire algorithm of optimizing mutual information and policy. Algorithm 1 BIRD Algorithm |
| Open Source Code | No | No explicit statement about releasing the code for the described method or a link to its repository was found. The paper mentions: "We implement Dreamer by its released codes (https://github.com/google-research/dreamer)", which refers to a baseline, not their own work. |
| Open Datasets | Yes | We evaluate BIRD on Deep Mind Control Suite (https://github.com/deepmind/dm_control) [30], a standard benchmark for continuous control. |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits, only stating: "Among all environments, observations are 64 64 3 images, rewards are scaled to 0 to 1, and the dimensions of action space vary from 1 to 12 . Action repeat is fixed at 2 for all tasks." It also mentions "Buffersize is 100k" which implies data usage but not specific splits. |
| Hardware Specification | Yes | We train BIRD with a single Nvidia 2080Ti and a single CPU, and it takes 8 hours to run 1 million samples. |
| Software Dependencies | No | The paper mentions: "Policy network, reward network, and value network are all implemented with multi-layer perceptrons (MLP) and they respectively trained with Adam optimizer [65]." It mentions Adam optimizer but does not provide a version number. It also mentions "CNN layers" and "GRU [64]" but no versions. |
| Experiment Setup | Yes | Among all environments, observations are 64 64 3 images, rewards are scaled to 0 to 1, and the dimensions of action space vary from 1 to 12 . Action repeat is fixed at 2 for all tasks. We implement Dreamer by its released codes (https://github.com/google-research/dreamer) and all hyper-parameters remain the same as reported... For all experiments, we select a discount factor of 0.99 and a mutual information coefficient of 1e-8. Buffersize is 100k. |