Model-based Policy Optimization with Unsupervised Model Adaptation
Authors: Jian Shen, Han Zhao, Weinan Zhang, Yong Yu
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, our approach achieves state-of-the-art performance in terms of sample efficiency on a range of continuous control benchmark tasks. |
| Researcher Affiliation | Collaboration | Shanghai Jiao Tong University, D. E. Shaw & Co {rockyshen, wnzhang, yyu}@apex.sjtu.edu.cn han.zhao@cs.cmu.edu |
| Pseudocode | Yes | Algorithm 1 AMPO |
| Open Source Code | Yes | Our code is publicly available at: https://github.com/RockySJ/ampo |
| Open Datasets | Yes | We evaluate AMPO and other baselines on six Mu Jo Co continuous control tasks with a maximum horizon of 1000 from Open AI Gym [Brockman et al., 2016], including Inverted Pendulum, Swimmer, Hopper, Walker2d, Ant and Half Cheetah. |
| Dataset Splits | No | Every time we train the dynamics model, we randomly sample several real data as a validation set and stop the model training if the model loss does not decrease for five gradient steps, which means we do not choose a specific value for the hyperparameter G1. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments. |
| Software Dependencies | No | We implement all our experiments using Tensor Flow. |
| Experiment Setup | Yes | In each adaptation iteration, we train the critic for five steps and then train the feature extractor for one step, and the coefficient α of gradient penalty is set to 10. |