CoMPS: Continual Meta Policy Search
Authors: Glen Berseth, Zhiwei Zhang, Grace Zhang, Chelsea Finn, Sergey Levine
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that as the agent experiences more tasks, learning time on new tasks decreases, indicating that meta-reinforcement learning performance increases asymptotically with the number of tasks.Our experiments aim to analyze the performance of Co MPS on both stationary and non-stationary task sequences in the continual meta-learning setting, where each task is observed once and never revisited again during the learning process. |
| Researcher Affiliation | Academia | Anonymous authors Paper under double-blind review |
| Pseudocode | Yes | Algorithm 1 Co MPS Meta-Learning |
| Open Source Code | Yes | We have also provided the code used for the experiments. |
| Open Datasets | Yes | Last, we utilize the suite of robotic manipulation tasks from Meta World Yu et al. (2020b) |
| Dataset Splits | No | The paper mentions 'held-out tasks' but does not specify explicit training/validation/test dataset splits with percentages or counts. |
| Hardware Specification | No | The paper mentions 'virtual machines with 8 CPUs and 16 Gi B of memory' and 'cloud computing technology (AWS/GCP/Azure/slurm)', but lacks specific CPU/GPU models or detailed cloud instance specifications. |
| Software Dependencies | No | The paper mentions software components like PPO but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | For the RL part of Co MPS, GMPS+PPO, PPO+TL, and PNC we explored the best hyperparameters... For the meta-step (M) that Co MPS and GMPS+PPO share we tuned those algorithms using a similar process. We searched for the best parameter values of learning rate, the number of training updates per batch of data, alpha, batch size, and the number of total iterations or after each new task. We selected the parameters that achieved the highest stable rewards during training. We used the same values for both Co MPS and GMPS+PPO. For PNC we also performed hyperparameter tuning over the learning rate, the number of consolidation/compress steps, and EWC weight. For PEARL we performed hyperparameter tuning over the learning rate, batch size and the number of gradient steps.Table 1: Co MPS Hyperparameters |