Sample-Efficient Reinforcement Learning via Conservative Model-Based Actor-Critic
Authors: Zhihai Wang, Jie Wang, Qi Zhou, Bin Li, Houqiang Li8612-8620
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate that CMBAC significantly outperforms state-of-the-art approaches in terms of sample efficiency on several challenging tasks, and the proposed method is more robust than previous methods in noisy environments. Experiments show that CMBAC significantly outperforms state-of-the-art methods in terms of sample efficiency on several challenging control tasks (Brockman et al. 2016; Todorov, Erez, and Tassa 2012). |
| Researcher Affiliation | Academia | 1CAS Key Laboratory of Technology in GIPAS, University of Science and Technology of China 2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center {zhwangx, zhouqida}@mail.ustc.edu.cn {jiewangx, binli, lihq}@ustc.edu.cn |
| Pseudocode | Yes | Algorithm 1: Pseudo code for CMBAC . |
| Open Source Code | No | The paper does not include an explicit statement about making the source code available or provide a link to a code repository for the described methodology. |
| Open Datasets | Yes | We evaluate CMBAC and these baselines on Mu Jo Co (Todorov, Erez, and Tassa 2012) benchmark tasks as used in MBPO. |
| Dataset Splits | No | The paper describes the process of data collection and model usage (e.g., Denv, Dmodel) but does not provide specific train/validation/test dataset splits (percentages, sample counts, or citations to predefined splits) to reproduce data partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for running the experiments. |
| Software Dependencies | No | The paper mentions using MuJoCo benchmark tasks and Soft Actor-Critic (SAC) for policy learning, but it does not provide specific version numbers for these or other software components/libraries (e.g., Python, PyTorch, TensorFlow). |
| Experiment Setup | Yes | For all environments except Walker2d, we use the number of dropped estimates L = 1. On Walker2d, we use L = 0. For our method, we select the hyperparameter M for each environment independently via grid search. The best hyperparameter for Humanoid, Hopper, Walker2d, and the rest is M = 1, 3, 4, 2, respectively. The details of the experimental setup are in Appendix B. |