Sample-Efficient Reinforcement Learning via Conservative Model-Based Actor-Critic

Authors: Zhihai Wang, Jie Wang, Qi Zhou, Bin Li, Houqiang Li8612-8620

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that CMBAC significantly outperforms state-of-the-art approaches in terms of sample efficiency on several challenging tasks, and the proposed method is more robust than previous methods in noisy environments. Experiments show that CMBAC significantly outperforms state-of-the-art methods in terms of sample efficiency on several challenging control tasks (Brockman et al. 2016; Todorov, Erez, and Tassa 2012).
Researcher Affiliation Academia 1CAS Key Laboratory of Technology in GIPAS, University of Science and Technology of China 2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center {zhwangx, zhouqida}@mail.ustc.edu.cn {jiewangx, binli, lihq}@ustc.edu.cn
Pseudocode Yes Algorithm 1: Pseudo code for CMBAC .
Open Source Code No The paper does not include an explicit statement about making the source code available or provide a link to a code repository for the described methodology.
Open Datasets Yes We evaluate CMBAC and these baselines on Mu Jo Co (Todorov, Erez, and Tassa 2012) benchmark tasks as used in MBPO.
Dataset Splits No The paper describes the process of data collection and model usage (e.g., Denv, Dmodel) but does not provide specific train/validation/test dataset splits (percentages, sample counts, or citations to predefined splits) to reproduce data partitioning.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies No The paper mentions using MuJoCo benchmark tasks and Soft Actor-Critic (SAC) for policy learning, but it does not provide specific version numbers for these or other software components/libraries (e.g., Python, PyTorch, TensorFlow).
Experiment Setup Yes For all environments except Walker2d, we use the number of dropped estimates L = 1. On Walker2d, we use L = 0. For our method, we select the hyperparameter M for each environment independently via grid search. The best hyperparameter for Humanoid, Hopper, Walker2d, and the rest is M = 1, 3, 4, 2, respectively. The details of the experimental setup are in Appendix B.