reproducibilityindex.ai

Task-aware world model learning with meta weighting via bi-level optimization

Authors: Huining Yuan, Hongkun Dou, Xingyu Jiang, Yue Deng

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate TEMPO on a variety of continuous and discrete control tasks from the Deep Mind Control Suite and Atari video games. Our results demonstrate that TEMPO achieves state-of-the-art performance regarding asymptotic performance, training stability, and convergence speed. Furthermore, we perform ablation studies to demonstrate the advantage of our proposed meta-weighting mechanism.
Researcher Affiliation	Academia	Huining Yuan Hongkun Dou Xingyu Jiang Yue Deng School of Astronautics, Beihang University, Beijing, China {hnyuan, douhk, jxy33zrhd, ydeng}@buaa.edu.cn
Pseudocode	Yes	Algorithm 1: Task-aware Environment Modeling Pipeline with bi-level Optimization (TEMPO)
Open Source Code	Yes	A sample code of TEMPO is available at https://github.com/deng-ai-lab/TEMPO.
Open Datasets	Yes	We use 9 continuous control tasks from the Deep Mind Control (DMC) Suite (Tassa et al., 2018)... and 6 discrete control tasks from Atari video games (Bellemare et al., 2013)...
Dataset Splits	No	The paper does not explicitly specify train/validation/test dataset splits by percentages, sample counts, or refer to predefined splits with citations for reproduction.
Hardware Specification	Yes	All experiments were run on a single Nvidia RTX 3090 GPU with Python 3.7 and Tensorflow 2.6.
Software Dependencies	Yes	All experiments were run on a single Nvidia RTX 3090 GPU with Python 3.7 and Tensorflow 2.6.
Experiment Setup	Yes	Table 1: Main hyperparameters of TEMPO. DMC stands for Deep Mind Control tasks. We use the default setting of Dreamerv2 in World model, Agent, and Common. Specifically, we build the meta weighter to be a 5-layer dense network (MLP) with concatenated states as input (see Equation 5) and scalar meta weight as output. All hidden dimensions of the network are set to 400. Batch normalization (Ioffe and Szegedy, 2015) and ELU (Clevert et al., 2015) activation are performed after each hidden layer. A sigmoid function and an additive bias are applied to the meta weights after the output layer, so that the weights center around 1, i.e. weight = 0.5 σ(output) + 0.75. An Adam optimizer (Kingma and Ba, 2014) with a learning rate of 1e-4 is used for updating the meta weighter.