Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Model-based Policy Optimization with Unsupervised Model Adaptation
Authors: Jian Shen, Han Zhao, Weinan Zhang, Yong Yu
NeurIPS 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, our approach achieves state-of-the-art performance in terms of sample efficiency on a range of continuous control benchmark tasks. |
| Researcher Affiliation | Collaboration | Shanghai Jiao Tong University, D. E. Shaw & Co EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 AMPO |
| Open Source Code | Yes | Our code is publicly available at: https://github.com/RockySJ/ampo |
| Open Datasets | Yes | We evaluate AMPO and other baselines on six Mu Jo Co continuous control tasks with a maximum horizon of 1000 from Open AI Gym [Brockman et al., 2016], including Inverted Pendulum, Swimmer, Hopper, Walker2d, Ant and Half Cheetah. |
| Dataset Splits | No | Every time we train the dynamics model, we randomly sample several real data as a validation set and stop the model training if the model loss does not decrease for five gradient steps, which means we do not choose a specific value for the hyperparameter G1. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments. |
| Software Dependencies | No | We implement all our experiments using Tensor Flow. |
| Experiment Setup | Yes | In each adaptation iteration, we train the critic for five steps and then train the feature extractor for one step, and the coefficient α of gradient penalty is set to 10. |