Masked Autoencoding for Scalable and Generalizable Decision Making
Authors: Fangchen Liu, Hao Liu, Aditya Grover, Pieter Abbeel
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our empirical study, we find that a Mask DP model gains the capability of zero-shot transfer to new BC tasks, such as single and multiple goal reaching, and it can zero-shot infer skills from a few example transitions. In addition, Mask DP transfers well to offline RL and shows promising scaling behavior w.r.t. to model size. It is amenable to data-efficient finetuning, achieving competitive results with prior methods based on autoregressive pretraining1. In our experiments, we evaluate transfer learning in downstream tasks using Mask DP. Section 4.1 introduces the environments, pretraining, and the baselines compared in experiments. Section 4.2 summarizes the results of Mask DP on goal reaching, skill prompting, and offline RL. Through further analysis in Section 4.3, we present an ablation study on various design choices of our model. |
| Researcher Affiliation | Academia | Fangchen Liu1 * Hao Liu1 * Aditya Grover2 Pieter Abbeel1 1 Berkeley AI Research, UC Berkeley 2 UCLA |
| Pseudocode | No | The paper does not include pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | The implementation of Mask DP is available at https://github.com/Fangchen Liu/Mask DP_public |
| Open Datasets | Yes | We adopt the environment setup used in EXo RL [37], based on Deep Mind control suite [29], where a domain describes the type of agent (e.g. Walker) but tasks are specified by rewards (e.g., Walker walk, Walker run). We provide a 2M buffer of the data collected by Proto-RL [36] as in Exo RL [37] does |
| Dataset Splits | Yes | single-goal reaching: For every trajectory in the validation set, we randomly sample a start state and a future state in T ∈ [15, 20) steps as the goal. All the methods are evaluated on the same set of 300 state-goal pairs with a given budget of T + 3. multi-goal reaching: For every trajectory in the validation set, we randomly sample a start state and 5 goal states at random future timesteps from [12, 60). We evaluate the same set of 100 state-goal sequences and add additional 5 timestep budgets for all the goals. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for experiments, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper does not list specific version numbers for software dependencies or libraries used in the experiments (e.g., Python version, PyTorch/TensorFlow versions). |
| Experiment Setup | Yes | We pretrain agents for 400K gradient steps. By default, Mask DP uses a 3-layer encoder and 2-layer decoder, and the baselines based on GPT use 5 attention layers. Mask DP and all the above models are comparable with similar architecture design and size, and share the same training hyper-parameters. Details about the architecture and training of Mask DP and the above baselines can be found in Section A. |