Online Planning in POMDPs with Self-Improving Simulators
Authors: Jinke He, Miguel Suau, Hendrik Baier, Michael Kaisers, Frans A. Oliehoek
IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results in two large domains show that when integrated with POMCP, our approach allows to plan with improving efficiency over time. We perform the evaluation on two large POMDPs introduced by [He et al., 2020], the Grab A Chair (GAC) domain and the Grid Traffic Control (GTC) domain, of which descriptions can be found in Appendix C.1. |
| Researcher Affiliation | Academia | 1Delft University of Technology, The Netherlands 2Centrum Wiskunde & Informatica, The Netherlands |
| Pseudocode | Yes | Algorithm 1 outlines our approach. |
| Open Source Code | No | The paper does not provide a link to source code for the described methodology. The provided arXiv link is for an extended version of the paper itself. |
| Open Datasets | Yes | We perform the evaluation on two large POMDPs introduced by [He et al., 2020], the Grab A Chair (GAC) domain and the Grid Traffic Control (GTC) domain, of which descriptions can be found in Appendix C.1. |
| Dataset Splits | No | The paper mentions training data and a test dataset, but does not specify validation sets or detailed splits (e.g., percentages or counts for training, validation, and test sets). |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for experiments, such as CPU/GPU models, memory, or cloud instance types. |
| Software Dependencies | No | The paper mentions components like GRU and stochastic gradient descent, but does not provide specific version numbers for software dependencies or libraries (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | In all planning experiments with self-improving simulators, we start with an IALS that makes use of a completely untrained ˆIθ, implemented by a GRU; after every real episode it is trained for 64 gradient steps with the accumulated data from the global simulations. The results are averaged over 2500 and 1000 individual runs for the GAC and GTC domains, respectively. ... allowing 1/64 and 1/16 seconds for each decision correspondingly... We fix the number of POMCP simulations to 100 per planning step. |