Task-aware world model learning with meta weighting via bi-level optimization
Authors: Huining Yuan, Hongkun Dou, Xingyu Jiang, Yue Deng
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate TEMPO on a variety of continuous and discrete control tasks from the Deep Mind Control Suite and Atari video games. Our results demonstrate that TEMPO achieves state-of-the-art performance regarding asymptotic performance, training stability, and convergence speed. Furthermore, we perform ablation studies to demonstrate the advantage of our proposed meta-weighting mechanism. |
| Researcher Affiliation | Academia | Huining Yuan Hongkun Dou Xingyu Jiang Yue Deng School of Astronautics, Beihang University, Beijing, China {hnyuan, douhk, jxy33zrhd, ydeng}@buaa.edu.cn |
| Pseudocode | Yes | Algorithm 1: Task-aware Environment Modeling Pipeline with bi-level Optimization (TEMPO) |
| Open Source Code | Yes | A sample code of TEMPO is available at https://github.com/deng-ai-lab/TEMPO. |
| Open Datasets | Yes | We use 9 continuous control tasks from the Deep Mind Control (DMC) Suite (Tassa et al., 2018)... and 6 discrete control tasks from Atari video games (Bellemare et al., 2013)... |
| Dataset Splits | No | The paper does not explicitly specify train/validation/test dataset splits by percentages, sample counts, or refer to predefined splits with citations for reproduction. |
| Hardware Specification | Yes | All experiments were run on a single Nvidia RTX 3090 GPU with Python 3.7 and Tensorflow 2.6. |
| Software Dependencies | Yes | All experiments were run on a single Nvidia RTX 3090 GPU with Python 3.7 and Tensorflow 2.6. |
| Experiment Setup | Yes | Table 1: Main hyperparameters of TEMPO. DMC stands for Deep Mind Control tasks. We use the default setting of Dreamerv2 in World model, Agent, and Common. Specifically, we build the meta weighter to be a 5-layer dense network (MLP) with concatenated states as input (see Equation 5) and scalar meta weight as output. All hidden dimensions of the network are set to 400. Batch normalization (Ioffe and Szegedy, 2015) and ELU (Clevert et al., 2015) activation are performed after each hidden layer. A sigmoid function and an additive bias are applied to the meta weights after the output layer, so that the weights center around 1, i.e. weight = 0.5 σ(output) + 0.75. An Adam optimizer (Kingma and Ba, 2014) with a learning rate of 1e-4 is used for updating the meta weighter. |