Pre-Trained Multi-Goal Transformers with Prompt Optimization for Efficient Online Adaptation
Authors: Haoqi Yuan, Yuhui Fu, Feiyang Xie, Zongqing Lu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate MGPO across diverse domains, including maze navigation, the robotic simulation environment Kitchen, and the open-world game Crafter. Our results demonstrate that MGPO significantly surpasses prior methods in terms of sample efficiency, online adaptation performance, robustness, and interpretability compared with existing methods.In this section, we present experimental results obtained across various domains to evaluate the efficacy of MGPO. |
| Researcher Affiliation | Academia | 1School of Computer Science, Peking University 2Yuanpei College, Peking University 3Beijing Academy of Artificial Intelligence |
| Pseudocode | Yes | Our prompt optimization method is detailed in Algorithm 1 in Appendix E.1, where the implementations of UCB and BPE are also provided.Algorithm 1 Prompt Optimization in MGPO-UCB and MGPO-BPE |
| Open Source Code | Yes | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We have released the code. |
| Open Datasets | Yes | Kitchen [16] is a robotic environment... We verified that the Kitchen datasets provided in D4RL [13] do not contain diverse transitions between the five subtasks. To collect trajectories completing different sets of subtasks, we use PPO [46] to train a policy for each sub-task using a shaped reward function and varied initial states from the Kitchen-mixed-v0 dataset [13].Crafter [18]: A simplified benchmark of the open-world game Minecraft... The dataset is collected using policies from AD [23]. |
| Dataset Splits | No | The paper describes training and testing but does not explicitly provide details about a separate validation dataset split with specific percentages or sample counts for hyperparameter tuning or early stopping, beyond the online adaptation phase. |
| Hardware Specification | Yes | All models are trained on a lab machine with a single NVIDIA RTX 4090 GPU and Intel i9 CPUs. |
| Software Dependencies | No | The paper mentions the use of "GPT-2" as a backbone and describes the model architecture, but it does not specify software dependencies like programming language, deep learning frameworks (e.g., PyTorch, TensorFlow), or other libraries with their specific version numbers. |
| Experiment Setup | Yes | Table 6: Hyperparameters used in pre-training for all environments. Embedding dimension 128 Number of layers 3 Number of attention heads 1 Activation Re LU Batch size 64 Learning rate 1e-4 Learning rate decay weight 1e-4 Dropout 0.1 Warmup steps 10000.In Maze Runner, we sample prompts from the agent s locations in the whole trajectory. To augment the diversity of task goals and trajectory lengths, we truncate the trajectory at a random timestep h for each sampled trajectory and use oh to represent its task goal. |