reproducibilityindex.ai

Bridging Imagination and Reality for Model-Based Deep Reinforcement Learning

Authors: Guangxiang Zhu, Minghao Zhang, Honglak Lee, Chongjie Zhang

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on challenging visual control benchmarks (Deep Mind Control Suite with image inputs [30]) and the results demonstrate that BIRD achieves state-of-the-art performance in terms of sample efﬁciency. Our ablation study further veriﬁes the superiority of BIRD beneﬁts from mutual information maximization rather than from the increase of policy entropy.
Researcher Affiliation	Academia	Guangxiang Zhu IIIS Tsinghua University guangxiangzhu@outlook.com Minghao Zhang School of Software Tsinghua University mehoozhang@gmail.com Honglak Lee EECS University of Michigan honglak@eecs.umich.edu Chongjie Zhang IIIS Tsinghua University chongjie@tsinghua.edu.cn
Pseudocode	Yes	Algorithm 1 summarizes our entire algorithm of optimizing mutual information and policy. Algorithm 1 BIRD Algorithm
Open Source Code	No	No explicit statement about releasing the code for the described method or a link to its repository was found. The paper mentions: "We implement Dreamer by its released codes (https://github.com/google-research/dreamer)", which refers to a baseline, not their own work.
Open Datasets	Yes	We evaluate BIRD on Deep Mind Control Suite (https://github.com/deepmind/dm_control) [30], a standard benchmark for continuous control.
Dataset Splits	No	The paper does not explicitly provide training/test/validation dataset splits, only stating: "Among all environments, observations are 64 64 3 images, rewards are scaled to 0 to 1, and the dimensions of action space vary from 1 to 12 . Action repeat is ﬁxed at 2 for all tasks." It also mentions "Buffersize is 100k" which implies data usage but not specific splits.
Hardware Specification	Yes	We train BIRD with a single Nvidia 2080Ti and a single CPU, and it takes 8 hours to run 1 million samples.
Software Dependencies	No	The paper mentions: "Policy network, reward network, and value network are all implemented with multi-layer perceptrons (MLP) and they respectively trained with Adam optimizer [65]." It mentions Adam optimizer but does not provide a version number. It also mentions "CNN layers" and "GRU [64]" but no versions.
Experiment Setup	Yes	Among all environments, observations are 64 64 3 images, rewards are scaled to 0 to 1, and the dimensions of action space vary from 1 to 12 . Action repeat is ﬁxed at 2 for all tasks. We implement Dreamer by its released codes (https://github.com/google-research/dreamer) and all hyper-parameters remain the same as reported... For all experiments, we select a discount factor of 0.99 and a mutual information coefﬁcient of 1e-8. Buffersize is 100k.