Trust the Model When It Is Confident: Masked Model-based Actor-Critic
Authors: Feiyang Pan, Jia He, Dandan Tu, Qing He
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on continuous control benchmarks demonstrate that M2AC has strong performance even when using long model rollouts in very noisy environments, and it significantly outperforms previous state-of-the-art methods. |
| Researcher Affiliation | Collaboration | Feiyang Pan 1,3 Jia He 2 Dandan Tu2 Qing He1,3 1IIP, Institute of Computing Technology, Chinese Academy of Sciences. 2Huawei EI Innovation Lab. 3University of Chinese Academy of Sciences. |
| Pseudocode | Yes | Algorithm 1 Actual algorithm of M2AC, Algorithm 2 Masked Model Rollouts for data generation |
| Open Source Code | No | The paper mentions basing its implementation on the "opensourced implementation of MBPO" and refers to "opensourced benchmarks [6, 1]", but does not provide a link or explicit statement about releasing the source code for their own M2AC methodology. |
| Open Datasets | Yes | Figure 3 demonstrates the results in four Mu Jo Co-v2 [23] environments. We implement noisy derivatives based on Half Cheetah and Walker2d, each with three levels: σ = 0.05 for -Noisy0 , σ = 0.1 for -Noisy1 , and σ = 0.2 for -Noisy2. |
| Dataset Splits | No | The paper mentions "early-stopping on a hold-out validation set", but does not specify the exact split percentages, sample counts, or reference to predefined standard splits for this validation set. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using Soft-Actor-Critic (SAC) as the base algorithm and comparing against PPO and DDPG, but it does not provide specific version numbers for any software dependencies like programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | In the implementation of MBRL methods, the most important hyper-parameter is the maximum model rollout length Hmax. In MBPO [11] the authors test and tune this parameter for every task, but we argue that in real applications we can not know the best choice in advance nor can we tune it. So for a fair and comprehensive comparison, we test Hmax = 1, 4, 7, 10 for all the algorithms and tasks. The main experiments involve the following implementation details for M2AC: the masking uses the non-stop mode (Algorithm 2, line 12). The masking rate is set as w = 0.5 when Hmax = 1, and is a decaying linear function wh = Hmax h 2(Hmax+1) when Hmax > 1. The model error penalty is set as = 10 3. |