Task-aware world model learning with meta weighting via bi-level optimization

Authors: Huining Yuan, Hongkun Dou, Xingyu Jiang, Yue Deng

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate TEMPO on a variety of continuous and discrete control tasks from the Deep Mind Control Suite and Atari video games. Our results demonstrate that TEMPO achieves state-of-the-art performance regarding asymptotic performance, training stability, and convergence speed. Furthermore, we perform ablation studies to demonstrate the advantage of our proposed meta-weighting mechanism.
Researcher Affiliation Academia Huining Yuan Hongkun Dou Xingyu Jiang Yue Deng School of Astronautics, Beihang University, Beijing, China {hnyuan, douhk, jxy33zrhd, ydeng}@buaa.edu.cn
Pseudocode Yes Algorithm 1: Task-aware Environment Modeling Pipeline with bi-level Optimization (TEMPO)
Open Source Code Yes A sample code of TEMPO is available at https://github.com/deng-ai-lab/TEMPO.
Open Datasets Yes We use 9 continuous control tasks from the Deep Mind Control (DMC) Suite (Tassa et al., 2018)... and 6 discrete control tasks from Atari video games (Bellemare et al., 2013)...
Dataset Splits No The paper does not explicitly specify train/validation/test dataset splits by percentages, sample counts, or refer to predefined splits with citations for reproduction.
Hardware Specification Yes All experiments were run on a single Nvidia RTX 3090 GPU with Python 3.7 and Tensorflow 2.6.
Software Dependencies Yes All experiments were run on a single Nvidia RTX 3090 GPU with Python 3.7 and Tensorflow 2.6.
Experiment Setup Yes Table 1: Main hyperparameters of TEMPO. DMC stands for Deep Mind Control tasks. We use the default setting of Dreamerv2 in World model, Agent, and Common. Specifically, we build the meta weighter to be a 5-layer dense network (MLP) with concatenated states as input (see Equation 5) and scalar meta weight as output. All hidden dimensions of the network are set to 400. Batch normalization (Ioffe and Szegedy, 2015) and ELU (Clevert et al., 2015) activation are performed after each hidden layer. A sigmoid function and an additive bias are applied to the meta weights after the output layer, so that the weights center around 1, i.e. weight = 0.5 σ(output) + 0.75. An Adam optimizer (Kingma and Ba, 2014) with a learning rate of 1e-4 is used for updating the meta weighter.