Constrained Bayesian Reinforcement Learning via Approximate Linear Programming
Authors: Jongmin Lee, Youngsoo Jang, Pascal Poupart, Kee-Eung Kim
IJCAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide theoretical guarantees and demonstrate empirically that our approach outperforms the state of the art. 5 Experiments We conducted experiments on 3 discrete state domains and 1 continuous state domain depicted in Figure 2. |
| Researcher Affiliation | Academia | School of Computing, KAIST, Republic of Korea David R. Cheriton School of Computer Science, University of Waterloo, Canada {jmlee, ysjang}@ai.kaist.ac.kr, ppoupart@uwaterloo.ca, kekim@cs.kaist.ac.kr |
| Pseudocode | Yes | Algorithm 1 Constrained BRL via Approximate LP |
| Open Source Code | No | The paper does not provide a statement or link for open-source code for the described methodology. |
| Open Datasets | Yes | The chain domain [Dearden et al., 1998] has 5 states and 2 actions. The maze domain [Strens, 2000] has 33 rooms and three flags. The cliff domain [Sutton and Barto, 1998] has 24 states and 4 actions. The cart pole [Sutton and Barto, 1998] domain is the classical control problem to keep the pole upright. |
| Dataset Splits | No | The paper describes how beliefs were collected and averaged over trials, but does not specify explicit train/validation/test splits of a static dataset in the common sense. |
| Hardware Specification | No | As for the computation time, it took 105 minutes to construct b T(s , b |s, b, a) and 0.3 minute to solve the LP on a single CPU machine. (No specific CPU model or other hardware details are provided.) |
| Software Dependencies | No | The paper mentions solving the LP (Linear Program) and refers to CPBVI and CALP as algorithms, but it does not specify versions for any ancillary software dependencies or solvers like CPLEX that might be used for LP solving. |
| Experiment Setup | Yes | 5.1 Experimental Setup Finite State Domains For all the finite state domains, we used two kinds of structural priors, tied and semi . In both tied and semi, it is assumed that the transition dynamics are known except for the slip probabilities. We used uninformative Dirichlet priors. Approximate transitions in (10) were constructed by W(b |bsas ) exp d(b , bsas ) where σ is set to 0.5. b B were collected by uniformly random policy executed in the environment for 50 time steps. In the chain and cliff domains, we report the results of 200 trials of the first 2000 times steps with discount factor γ = 0.99. In the maze domain, we report the results of 100 tirals of the first 1000 time steps with discount factor γ = 0.95. Continuous State Domain In cart pole, the environment dynamics are modeled as a linear dynamical system, as in [Tziortziotis et al., 2013]. We use matrix-normal prior for A and inverse-Wishart prior for V. To construct finite set of beliefs b S b B, we collected 10000 states by taking random actions and quantized them into 256 representative states via K-means clustering, which constitutes b S. Then, b B was constructed by gathering beliefs during the first 100 steps at every 10 step-interval. W(b |bsas ) was defined by (13) with σ = 1. The results are averaged over 100 trials of the first 10000 steps with discount factor γ = 0.99. |