Provably Efficient Model-based Policy Adaptation

Authors: Yuda Song, Aditi Mavalankar, Wen Sun, Sicun Gao

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the benefits of our approach for policy adaptation in a diverse set of continuous control tasks, achieving the performance of state-of-the-art methods with much lower sample complexity. Our project website, including code, can be found at https: //yudasong.github.io/PADA.
Researcher Affiliation Academia 1Department of Computer Science and Engineering, University of California, San Diego, La Jolla, USA 2Department of Computer Science, Cornell University, Ithaca , USA.
Pseudocode Yes Algorithm 1 Policy Adaptation with Data Aggregation; Algorithm 2 Policy Adaptation with Data Aggregation via Deviation Model
Open Source Code Yes Our project website, including code, can be found at https: //yudasong.github.io/PADA.
Open Datasets Yes We focus on standard Open AI Gym (Brockman et al., 2016) and Mujoco (Todorov et al., 2012) control environments such as Half Cheetah, Ant, and Reacher.
Dataset Splits No The paper describes training and testing in different environments but does not provide specific training/validation/test dataset splits (e.g., percentages or counts) from a fixed dataset for reproduction, which is common in supervised learning contexts.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory, or cloud instance types) used for running the experiments.
Software Dependencies No The paper mentions 'Open AI Gym' and 'Mujoco' as environments but does not provide specific version numbers for these or any other software libraries or dependencies used in the experiments.
Experiment Setup Yes More details of task designs are in Appendix B.1. ... We further include a long-term version of Fig 2 and the hyperparameters in the Appendix. (Appendix C: Hyperparameters: All policies for HalfCheetah and Ant are trained with Adam optimizer with learning rate 3e-4, batch size 64, and discount factor 0.99. For Reacher, we use Adam optimizer with learning rate 5e-4, batch size 128, and discount factor 0.95. The number of policy updates is 20 for HalfCheetah and Ant, and 10 for Reacher. We use a 2-layer neural network with 256 hidden units and ReLU activation for both policy and value networks. The entropy coefficient is 0.01 for all tasks.)