Trust the Model When It Is Confident: Masked Model-based Actor-Critic

Authors: Feiyang Pan, Jia He, Dandan Tu, Qing He

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on continuous control benchmarks demonstrate that M2AC has strong performance even when using long model rollouts in very noisy environments, and it significantly outperforms previous state-of-the-art methods.
Researcher Affiliation Collaboration Feiyang Pan 1,3 Jia He 2 Dandan Tu2 Qing He1,3 1IIP, Institute of Computing Technology, Chinese Academy of Sciences. 2Huawei EI Innovation Lab. 3University of Chinese Academy of Sciences.
Pseudocode Yes Algorithm 1 Actual algorithm of M2AC, Algorithm 2 Masked Model Rollouts for data generation
Open Source Code No The paper mentions basing its implementation on the "opensourced implementation of MBPO" and refers to "opensourced benchmarks [6, 1]", but does not provide a link or explicit statement about releasing the source code for their own M2AC methodology.
Open Datasets Yes Figure 3 demonstrates the results in four Mu Jo Co-v2 [23] environments. We implement noisy derivatives based on Half Cheetah and Walker2d, each with three levels: σ = 0.05 for -Noisy0 , σ = 0.1 for -Noisy1 , and σ = 0.2 for -Noisy2.
Dataset Splits No The paper mentions "early-stopping on a hold-out validation set", but does not specify the exact split percentages, sample counts, or reference to predefined standard splits for this validation set.
Hardware Specification No The paper does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions using Soft-Actor-Critic (SAC) as the base algorithm and comparing against PPO and DDPG, but it does not provide specific version numbers for any software dependencies like programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup Yes In the implementation of MBRL methods, the most important hyper-parameter is the maximum model rollout length Hmax. In MBPO [11] the authors test and tune this parameter for every task, but we argue that in real applications we can not know the best choice in advance nor can we tune it. So for a fair and comprehensive comparison, we test Hmax = 1, 4, 7, 10 for all the algorithms and tasks. The main experiments involve the following implementation details for M2AC: the masking uses the non-stop mode (Algorithm 2, line 12). The masking rate is set as w = 0.5 when Hmax = 1, and is a decaying linear function wh = Hmax h 2(Hmax+1) when Hmax > 1. The model error penalty is set as = 10 3.