reproducibilityindex.ai

Provably Efficient Model-based Policy Adaptation

Authors: Yuda Song, Aditi Mavalankar, Wen Sun, Sicun Gao

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the beneﬁts of our approach for policy adaptation in a diverse set of continuous control tasks, achieving the performance of state-of-the-art methods with much lower sample complexity. Our project website, including code, can be found at https: //yudasong.github.io/PADA.
Researcher Affiliation	Academia	1Department of Computer Science and Engineering, University of California, San Diego, La Jolla, USA 2Department of Computer Science, Cornell University, Ithaca , USA.
Pseudocode	Yes	Algorithm 1 Policy Adaptation with Data Aggregation; Algorithm 2 Policy Adaptation with Data Aggregation via Deviation Model
Open Source Code	Yes	Our project website, including code, can be found at https: //yudasong.github.io/PADA.
Open Datasets	Yes	We focus on standard Open AI Gym (Brockman et al., 2016) and Mujoco (Todorov et al., 2012) control environments such as Half Cheetah, Ant, and Reacher.
Dataset Splits	No	The paper describes training and testing in different environments but does not provide specific training/validation/test dataset splits (e.g., percentages or counts) from a fixed dataset for reproduction, which is common in supervised learning contexts.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU/GPU models, memory, or cloud instance types) used for running the experiments.
Software Dependencies	No	The paper mentions 'Open AI Gym' and 'Mujoco' as environments but does not provide specific version numbers for these or any other software libraries or dependencies used in the experiments.
Experiment Setup	Yes	More details of task designs are in Appendix B.1. ... We further include a long-term version of Fig 2 and the hyperparameters in the Appendix. (Appendix C: Hyperparameters: All policies for HalfCheetah and Ant are trained with Adam optimizer with learning rate 3e-4, batch size 64, and discount factor 0.99. For Reacher, we use Adam optimizer with learning rate 5e-4, batch size 128, and discount factor 0.95. The number of policy updates is 20 for HalfCheetah and Ant, and 10 for Reacher. We use a 2-layer neural network with 256 hidden units and ReLU activation for both policy and value networks. The entropy coefﬁcient is 0.01 for all tasks.)