Approximate Linear Programming for Constrained Partially Observable Markov Decision Processes
Authors: Pascal Poupart, Aarti Malhotra, Pei Pei, Kee-Eung Kim, Bongseok Goh, Michael Bowling
AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we describe a technique based on approximate linear programming to optimize policies in CPOMDPs. The approach outperforms a constrained version of point-based value iteration on a suite of benchmark problems. |
| Researcher Affiliation | Academia | David R. Cheriton School of Computer Science, University of Waterloo, Canada Department of Computer Science, Korean Advanced Institute of Science and Technology, Korea Department of Computing Science, University of Alberta, Canada |
| Pseudocode | Yes | Algorithm 1 Approximate LP; Algorithm 2 Iterative Approximate Linear Programming; Algorithm 3 Belief Generation |
| Open Source Code | No | The paper states that CALP and CPBVI are implemented in Matlab, but does not provide any links or explicit statements about the code being open-sourced. |
| Open Datasets | Yes | The toy, qcd and ncity problems are CPOMDP benchmarks from (Kim et al. 2011). The remaining problems are single objective POMDPs from http://www.pomdp.org that we augmented with costs. |
| Dataset Splits | No | The paper describes evaluation through simulations and convergence criteria, but it does not specify explicit train/validation/test dataset splits in the conventional sense of partitioning a static dataset. |
| Hardware Specification | Yes | Both CPBVI and CALP are implemented in Matlab without any explicit parallelization (other than the automated parallelization performed by Matlab s virtual machine) and were run on a 64bit Cent OS Linux machine with 24 cores (1.2 GHz), 132 Gb of RAM, Matlab R2013b and CPLEX 12.6. |
| Software Dependencies | Yes | Matlab R2013b and CPLEX 12.6. |
| Experiment Setup | Yes | The algorithms ran until convergence of the optimal value to 3 significant digits or a time limit of 1000 seconds (which ever occured first). The expected reward and cost were computed by running 1000 simulations of 1000 steps for CPBVI. |