Flow-based Recurrent Belief State Learning for POMDPs

Authors: Xiaoyu Chen, Yao Mark Mu, Ping Luo, Shengbo Li, Jianyu Chen

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments, we show that our methods successfully capture the complex belief states that enable multi-modal predictions as well as high quality reconstructions, and results on challenging visual-motor control tasks show that our method achieves superior performance and sample efficiency.
Researcher Affiliation Academia 1Institute for Interdisciplinary Information Sciences, Tsinghua University 2Department of Computer Science, The University of Hong Kong 3School of Vehicle and Mobility, Tsinghua University 4Shanghai Qizhi Institute.
Pseudocode Yes Algorithm 1 FORBES Algorithm
Open Source Code No The paper does not explicitly state that the source code for their methodology is publicly available, nor does it provide a link to a code repository.
Open Datasets Yes We adopt the MNIST Sequence Dataset (D. De Jong, 2016) that consists of sequences of handwriting MNIST digit stokes. and visual-motor control tasks from the Deep Mind Control Suite (Tassa et al., 2018).
Dataset Splits No The paper mentions training and testing on datasets but does not provide specific train/validation/test dataset splits (e.g., percentages or counts).
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments, only general mentions of neural networks.
Software Dependencies No The paper mentions specific algorithms and optimizers used (e.g., GRU, Adam) but does not list specific software libraries or packages with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes Hyperparameters For DMControl tasks, we pre-process images by reducing the bit depth to 5 bits and draw batches of 50 sequences of length 50 to train the FORBES model, value model, and action model models using Adam (Kingma & Ba, 2014) with learning rates α0 = 5 10 4, α1 = 8 10 5, α2 = 8 10 5, respectively and scale down gradient norms that exceed 100. We clip the KL regularizers in JModel below 3.0 free nats as in Dreamer and Pla Net. The imagination horizon is H = 15 and the same trajectories are used to update both action and value models. We compute the TD-λ targets with γ = 0.99 and λ = 0.95. As for multiple imagined trajectories, we choose N = 4 across all environments.