Meta-DT: Offline Meta-RL as Conditional Sequence Modeling with World Model Disentanglement
Authors: Zhi Wang, Li Zhang, Wenhao Wu, Yuanheng Zhu, Dongbin Zhao, Chunlin Chen
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on Mu Jo Co and Meta-World benchmarks across various dataset types show that Meta-DT exhibits superior few and zero-shot generalization capacity compared to strong baselines while being more practical with fewer prerequisites. |
| Researcher Affiliation | Academia | 1Nanjing University 2Institution of Automation, Chinese Academy of Sciences {zhiwang, clchen}@nju.edu.cn {lizhang, whao_wu}@smail.nju.edu.cn {yuanheng.zhu, dongbin.zhao}@ia.ac.cn |
| Pseudocode | Yes | Appendix A. Algorithm Pesudocodes Based on the implementations in Sec. 4, this section gives the brief procedures of our method. First, Algorithm 1 presents the pretraining of the context-aware world model. Then, Algorithm 2 shows the pipeline of training Meta-DT, where the sub-procedure of generating the complementary prompt is given in Algorithm 3. Finally, Algorithm 4 and Algorithm 5 show the few-shot and zero-shot evaluations on test tasks, respectively. |
| Open Source Code | Yes | Our code is available at https://github.com/NJU-RL/Meta-DT. |
| Open Datasets | Yes | We evaluate all tested methods on three classical benchmarks in meta-RL: i) the 2D navigation environment Point-Robot [25]; ii) the multi-task Mu Jo Co control [55, 36], containing Cheetah-Vel, Cheetah-Dir, Ant-Dir, Hopper-Param, and Walker-Param; and iii) the Meta World manipulation platform [56], including Reach, Sweep, and Door-Lock. |
| Dataset Splits | No | For each environment, we randomly sample a distribution of tasks and divide them into the training set T train and test set T test. ... For the Point-Robot and Mu Jo Co environments, we sample 45 tasks for training and another 5 held-out tasks for testing. For Meta-World environments, we sample 15 tasks for training and 5 held-out tasks for testing. (No explicit validation set mentioned for their Meta-DT training, only train/test.) |
| Hardware Specification | Yes | We train our models on one Nvidia RTX4080 GPU with the Intel Core i9-10900X CPU and 256G RAM. |
| Software Dependencies | No | The paper mentions implementing Meta-DT based on the official DT codebase and notes optimizer (Adam) and other parameters, but it does not specify versions for key software dependencies like Python, PyTorch/TensorFlow, or CUDA. |
| Experiment Setup | Yes | Some common hyperparameters across all report units are set as: optimizer Adam, weight decay 1e-4, linear warmup steps for learning rate decay 10000, gradient norm clip 0.25, dropout 0.1, and batch size 128. Table 7 presents the detailed hyperparameters of Meta-DT trained on the Point-Robot and Mu Jo Co domains with the Medium, Expert, and Mixed datasets. Table 8 presents the detailed hyperparameters of Meta-DT trained on Meta-World environments with the Medium datasets. |