Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Model-based Policy Optimization with Unsupervised Model Adaptation
Authors: Jian Shen, Han Zhao, Weinan Zhang, Yong Yu
NeurIPS 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, our approach achieves state-of-the-art performance in terms of sample efficiency on a range of continuous control benchmark tasks. |
| Researcher Affiliation | Collaboration | Shanghai Jiao Tong University, D. E. Shaw & Co EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 AMPO |
| Open Source Code | Yes | Our code is publicly available at: https://github.com/RockySJ/ampo |
| Open Datasets | Yes | We evaluate AMPO and other baselines on six Mu Jo Co continuous control tasks with a maximum horizon of 1000 from Open AI Gym [Brockman et al., 2016], including Inverted Pendulum, Swimmer, Hopper, Walker2d, Ant and Half Cheetah. |
| Dataset Splits | No | Every time we train the dynamics model, we randomly sample several real data as a validation set and stop the model training if the model loss does not decrease for five gradient steps, which means we do not choose a specific value for the hyperparameter G1. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments. |
| Software Dependencies | No | We implement all our experiments using Tensor Flow. |
| Experiment Setup | Yes | In each adaptation iteration, we train the critic for five steps and then train the feature extractor for one step, and the coefficient α of gradient penalty is set to 10. |