Deep Recurrent Belief Propagation Network for POMDPs

Authors: Yuhui Wang, Xiaoyang Tan10236-10244

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The effectiveness of the proposed method is verified on a collection of benchmark tasks, showing that our approach outperforms several state-of-the-art methods under various challenging scenarios. Extensive experiments on high dimensional benchmark tasks show that our approach outperforms several state-of-the-art methods under various challenging POMDP scenarios.
Researcher Affiliation Academia Yuhui Wang, Xiaoyang Tan College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence {y.wang, x.tan}@nuaa.edu.cn
Pseudocode Yes Our DRBPN algorithm is presented in Algorithm 1.
Open Source Code No The paper does not contain an explicit statement or link to the open-source code for the methodology it describes.
Open Datasets Yes We evaluated the methods on 8 benchmarks simulated locomotion tasks, which is implemented in Open AI Gym (Brockman et al. 2016) using the Mu Jo Co physics engine (Todorov, Erez, and Tassa 2012).
Dataset Splits No The paper mentions running experiments with '3 random seeds on each task' and averaging results over '30 episodes (10 episodes for every 3 random seeds)', but it does not specify explicit train/validation/test dataset splits or cross-validation details for data partitioning in the traditional sense for a dataset.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU, GPU models, or memory) used for running its experiments, only mentioning the use of Open AI Gym and Mu Jo Co physics engine for simulation.
Software Dependencies No The paper mentions using 'PPO' and implementing algorithms based on 'Open AI baselines (Dhariwal et al. 2017)', but it does not provide specific version numbers for these or any other key software components, which is necessary for reproducibility.
Experiment Setup Yes DRBPN adopts the same hyperparameters of the policy search components of PPO given in (Dhariwal et al. 2017), except that an additional transition network in DRBPN is set up. The covariance of the transition are state-independent and is a parameter of matrix, denoted as Σ (thus we have Σt = Σ for all t). We use Re LU as the activation function. We empirically set the penalty coefficient in Eq. (13) to be λv = 1.0, λm = 1.0. Each algorithm was run with 3 random seeds on each task. The algorithms are run for 1 * 10^6 timesteps.