Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Variational Model-based Policy Optimization

Authors: Yinlam Chow, Brandon Cui, Moonkyung Ryu, Mohammad Ghavamzadeh

IJCAI 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments on a number of continuous control tasks show that our model-based (E-step) algorithm, which we refer to as variational model-based policy optimization (VMBPO), is more sample-efﬁcient and robust to hyper-parameter tuning than its model-free (Estep) counterpart. Using the same control tasks, we also compare VMBPO with several state-of-theart model-based and model-free RL algorithms and show its sample efﬁciency and performance. 6 Experiments To illustrate the effectiveness of VMBPO, we (i) compare it with several state-of-the-art RL methods, and (ii) evaluate sample efﬁciency of MBRL via ablation analysis.
Researcher Affiliation	Industry	Yinlam Chow1 , Brandon Cui2 , Moon Kyung Ryu1 and Mohammad Ghavamzadeh1 1Google AI 2Facebook AI Research EMAIL, EMAIL, EMAIL
Pseudocode	Yes	We describe the E-step and M-step of VMBPO in details in Sections 5.1 and 5.2, and report its pseudo-code in Algorithm 1 in Appendix C.
Open Source Code	No	No explicit statement about releasing source code or a link to a code repository was found.
Open Datasets	Yes	We evaluate all the algorithms on a classical control task: Pendulum, and ﬁve Mu Jo Co tasks: Hopper, Walker2D, Half Cheetah, Reacher, and Reacher7Do F.
Dataset Splits	No	The paper mentions training steps and environments but does not provide specific details on training, validation, or test dataset splits or how data was partitioned for experiments.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments were provided.
Software Dependencies	No	The paper mentions using Mu Jo Co tasks but does not provide specific software dependencies with version numbers.
Experiment Setup	Yes	The detailed description of the network architectures and hyper-parameters is reported in Appendix E. We set the number of training steps to 400, 000 for the difﬁcult environments (Walker2D, Half Cheetah), to 150, 000 for the medium one (Hopper), and to 50, 000 for the simpler ones (Pendulum, Reacher, Reacher7DOF).