The Geometry of Memoryless Stochastic Policy Optimization in Infinite-Horizon POMDPs

Authors: Johannes Müller, Guido Montufar

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we use a navigation problem in a grid world to demonstrate that the polynomial programming formulation can offer a computationally feasible approach to the reward maximization problem. ... Using the modeling language Ju MP and the interior point solver Ipopt we directly obtained the globally optimal solution to problem (29) ... The computations took 0.01s (on a 2 GHz Quad-Core Intel Core i5 processor).
Researcher Affiliation Academia Johannes M uller Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany jmueller@mis.mpg.de Guido Mont ufar Department of Mathematics and Department of Statistics, UCLA, CA, USA Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany montufar@math.ucla.edu
Pseudocode Yes Algorithm 1 Polynomial programming for POMDPs
Open Source Code Yes The Julia code is available in the supplements and under https://github.com/muellerjohannes/ geometry-POMDPs-ICLR-2022.
Open Datasets No The paper uses custom-defined problem instances (e.g., a grid world) rather than established, publicly accessible datasets with formal access information. The problem parameters are described in the text.
Dataset Splits No The paper describes problem setups and solves them using an optimization approach, rather than training a machine learning model with distinct training/validation/test splits.
Hardware Specification Yes The computations took 0.01s (on a 2 GHz Quad-Core Intel Core i5 processor). ... The solver took around 0.03s consistently (on a 2 GHz Quad-Core Intel Core i5 processor).
Software Dependencies No The paper mentions using "Ju MP" and "Ipopt" (Julia packages) and "Python Sum Of Squares", but it does not specify their version numbers.
Experiment Setup Yes For the toy example: "We consider state, observation, and action spaces with two elements each, as well as following deterministic transition mechanism α, observation mechanism β, and instantaneous reward r...". For the grid world: "We consider the grid world depicted in Figure 6 with 13 states and 7 observations... The four actions are {R, L, U, D}... The transitions are deterministic... Let us now consider the uniform distribution µs = 1/13 for s S as an initial distribution and γ = 0.999 as a discount factor."