Deep Recurrent Belief Propagation Network for POMDPs
Authors: Yuhui Wang, Xiaoyang Tan10236-10244
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The effectiveness of the proposed method is verified on a collection of benchmark tasks, showing that our approach outperforms several state-of-the-art methods under various challenging scenarios. Extensive experiments on high dimensional benchmark tasks show that our approach outperforms several state-of-the-art methods under various challenging POMDP scenarios. |
| Researcher Affiliation | Academia | Yuhui Wang, Xiaoyang Tan College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence {y.wang, x.tan}@nuaa.edu.cn |
| Pseudocode | Yes | Our DRBPN algorithm is presented in Algorithm 1. |
| Open Source Code | No | The paper does not contain an explicit statement or link to the open-source code for the methodology it describes. |
| Open Datasets | Yes | We evaluated the methods on 8 benchmarks simulated locomotion tasks, which is implemented in Open AI Gym (Brockman et al. 2016) using the Mu Jo Co physics engine (Todorov, Erez, and Tassa 2012). |
| Dataset Splits | No | The paper mentions running experiments with '3 random seeds on each task' and averaging results over '30 episodes (10 episodes for every 3 random seeds)', but it does not specify explicit train/validation/test dataset splits or cross-validation details for data partitioning in the traditional sense for a dataset. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU, GPU models, or memory) used for running its experiments, only mentioning the use of Open AI Gym and Mu Jo Co physics engine for simulation. |
| Software Dependencies | No | The paper mentions using 'PPO' and implementing algorithms based on 'Open AI baselines (Dhariwal et al. 2017)', but it does not provide specific version numbers for these or any other key software components, which is necessary for reproducibility. |
| Experiment Setup | Yes | DRBPN adopts the same hyperparameters of the policy search components of PPO given in (Dhariwal et al. 2017), except that an additional transition network in DRBPN is set up. The covariance of the transition are state-independent and is a parameter of matrix, denoted as Σ (thus we have Σt = Σ for all t). We use Re LU as the activation function. We empirically set the penalty coefficient in Eq. (13) to be λv = 1.0, λm = 1.0. Each algorithm was run with 3 random seeds on each task. The algorithms are run for 1 * 10^6 timesteps. |