Provably Efficient Model-based Policy Adaptation
Authors: Yuda Song, Aditi Mavalankar, Wen Sun, Sicun Gao
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the benefits of our approach for policy adaptation in a diverse set of continuous control tasks, achieving the performance of state-of-the-art methods with much lower sample complexity. Our project website, including code, can be found at https: //yudasong.github.io/PADA. |
| Researcher Affiliation | Academia | 1Department of Computer Science and Engineering, University of California, San Diego, La Jolla, USA 2Department of Computer Science, Cornell University, Ithaca , USA. |
| Pseudocode | Yes | Algorithm 1 Policy Adaptation with Data Aggregation; Algorithm 2 Policy Adaptation with Data Aggregation via Deviation Model |
| Open Source Code | Yes | Our project website, including code, can be found at https: //yudasong.github.io/PADA. |
| Open Datasets | Yes | We focus on standard Open AI Gym (Brockman et al., 2016) and Mujoco (Todorov et al., 2012) control environments such as Half Cheetah, Ant, and Reacher. |
| Dataset Splits | No | The paper describes training and testing in different environments but does not provide specific training/validation/test dataset splits (e.g., percentages or counts) from a fixed dataset for reproduction, which is common in supervised learning contexts. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory, or cloud instance types) used for running the experiments. |
| Software Dependencies | No | The paper mentions 'Open AI Gym' and 'Mujoco' as environments but does not provide specific version numbers for these or any other software libraries or dependencies used in the experiments. |
| Experiment Setup | Yes | More details of task designs are in Appendix B.1. ... We further include a long-term version of Fig 2 and the hyperparameters in the Appendix. (Appendix C: Hyperparameters: All policies for HalfCheetah and Ant are trained with Adam optimizer with learning rate 3e-4, batch size 64, and discount factor 0.99. For Reacher, we use Adam optimizer with learning rate 5e-4, batch size 128, and discount factor 0.95. The number of policy updates is 20 for HalfCheetah and Ant, and 10 for Reacher. We use a 2-layer neural network with 256 hidden units and ReLU activation for both policy and value networks. The entropy coefficient is 0.01 for all tasks.) |