Model-based Policy Optimization with Unsupervised Model Adaptation

Authors: Jian Shen, Han Zhao, Weinan Zhang, Yong Yu

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, our approach achieves state-of-the-art performance in terms of sample efficiency on a range of continuous control benchmark tasks.
Researcher Affiliation Collaboration Shanghai Jiao Tong University, D. E. Shaw & Co {rockyshen, wnzhang, yyu}@apex.sjtu.edu.cn han.zhao@cs.cmu.edu
Pseudocode Yes Algorithm 1 AMPO
Open Source Code Yes Our code is publicly available at: https://github.com/RockySJ/ampo
Open Datasets Yes We evaluate AMPO and other baselines on six Mu Jo Co continuous control tasks with a maximum horizon of 1000 from Open AI Gym [Brockman et al., 2016], including Inverted Pendulum, Swimmer, Hopper, Walker2d, Ant and Half Cheetah.
Dataset Splits No Every time we train the dynamics model, we randomly sample several real data as a validation set and stop the model training if the model loss does not decrease for five gradient steps, which means we do not choose a specific value for the hyperparameter G1.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments.
Software Dependencies No We implement all our experiments using Tensor Flow.
Experiment Setup Yes In each adaptation iteration, we train the critic for five steps and then train the feature extractor for one step, and the coefficient α of gradient penalty is set to 10.