Variational Model-based Policy Optimization
Authors: Yinlam Chow, Brandon Cui, Moonkyung Ryu, Mohammad Ghavamzadeh
IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on a number of continuous control tasks show that our model-based (E-step) algorithm, which we refer to as variational model-based policy optimization (VMBPO), is more sample-efficient and robust to hyper-parameter tuning than its model-free (Estep) counterpart. Using the same control tasks, we also compare VMBPO with several state-of-theart model-based and model-free RL algorithms and show its sample efficiency and performance. 6 Experiments To illustrate the effectiveness of VMBPO, we (i) compare it with several state-of-the-art RL methods, and (ii) evaluate sample efficiency of MBRL via ablation analysis. |
| Researcher Affiliation | Industry | Yinlam Chow1 , Brandon Cui2 , Moon Kyung Ryu1 and Mohammad Ghavamzadeh1 1Google AI 2Facebook AI Research yinlamchow@google.com, bcui@fb.com, {mkryu, ghavamza}@google.com |
| Pseudocode | Yes | We describe the E-step and M-step of VMBPO in details in Sections 5.1 and 5.2, and report its pseudo-code in Algorithm 1 in Appendix C. |
| Open Source Code | No | No explicit statement about releasing source code or a link to a code repository was found. |
| Open Datasets | Yes | We evaluate all the algorithms on a classical control task: Pendulum, and five Mu Jo Co tasks: Hopper, Walker2D, Half Cheetah, Reacher, and Reacher7Do F. |
| Dataset Splits | No | The paper mentions training steps and environments but does not provide specific details on training, validation, or test dataset splits or how data was partitioned for experiments. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments were provided. |
| Software Dependencies | No | The paper mentions using Mu Jo Co tasks but does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | The detailed description of the network architectures and hyper-parameters is reported in Appendix E. We set the number of training steps to 400, 000 for the difficult environments (Walker2D, Half Cheetah), to 150, 000 for the medium one (Hopper), and to 50, 000 for the simpler ones (Pendulum, Reacher, Reacher7DOF). |