Variational Model-based Policy Optimization

Authors: Yinlam Chow, Brandon Cui, Moonkyung Ryu, Mohammad Ghavamzadeh

IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on a number of continuous control tasks show that our model-based (E-step) algorithm, which we refer to as variational model-based policy optimization (VMBPO), is more sample-efficient and robust to hyper-parameter tuning than its model-free (Estep) counterpart. Using the same control tasks, we also compare VMBPO with several state-of-theart model-based and model-free RL algorithms and show its sample efficiency and performance. 6 Experiments To illustrate the effectiveness of VMBPO, we (i) compare it with several state-of-the-art RL methods, and (ii) evaluate sample efficiency of MBRL via ablation analysis.
Researcher Affiliation Industry Yinlam Chow1 , Brandon Cui2 , Moon Kyung Ryu1 and Mohammad Ghavamzadeh1 1Google AI 2Facebook AI Research yinlamchow@google.com, bcui@fb.com, {mkryu, ghavamza}@google.com
Pseudocode Yes We describe the E-step and M-step of VMBPO in details in Sections 5.1 and 5.2, and report its pseudo-code in Algorithm 1 in Appendix C.
Open Source Code No No explicit statement about releasing source code or a link to a code repository was found.
Open Datasets Yes We evaluate all the algorithms on a classical control task: Pendulum, and five Mu Jo Co tasks: Hopper, Walker2D, Half Cheetah, Reacher, and Reacher7Do F.
Dataset Splits No The paper mentions training steps and environments but does not provide specific details on training, validation, or test dataset splits or how data was partitioned for experiments.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments were provided.
Software Dependencies No The paper mentions using Mu Jo Co tasks but does not provide specific software dependencies with version numbers.
Experiment Setup Yes The detailed description of the network architectures and hyper-parameters is reported in Appendix E. We set the number of training steps to 400, 000 for the difficult environments (Walker2D, Half Cheetah), to 150, 000 for the medium one (Hopper), and to 50, 000 for the simpler ones (Pendulum, Reacher, Reacher7DOF).