reproducibilityindex.ai

Constrained Bayesian Reinforcement Learning via Approximate Linear Programming

Authors: Jongmin Lee, Youngsoo Jang, Pascal Poupart, Kee-Eung Kim

IJCAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide theoretical guarantees and demonstrate empirically that our approach outperforms the state of the art. 5 Experiments We conducted experiments on 3 discrete state domains and 1 continuous state domain depicted in Figure 2.
Researcher Affiliation	Academia	School of Computing, KAIST, Republic of Korea David R. Cheriton School of Computer Science, University of Waterloo, Canada {jmlee, ysjang}@ai.kaist.ac.kr, ppoupart@uwaterloo.ca, kekim@cs.kaist.ac.kr
Pseudocode	Yes	Algorithm 1 Constrained BRL via Approximate LP
Open Source Code	No	The paper does not provide a statement or link for open-source code for the described methodology.
Open Datasets	Yes	The chain domain [Dearden et al., 1998] has 5 states and 2 actions. The maze domain [Strens, 2000] has 33 rooms and three ﬂags. The cliff domain [Sutton and Barto, 1998] has 24 states and 4 actions. The cart pole [Sutton and Barto, 1998] domain is the classical control problem to keep the pole upright.
Dataset Splits	No	The paper describes how beliefs were collected and averaged over trials, but does not specify explicit train/validation/test splits of a static dataset in the common sense.
Hardware Specification	No	As for the computation time, it took 105 minutes to construct b T(s , b \|s, b, a) and 0.3 minute to solve the LP on a single CPU machine. (No specific CPU model or other hardware details are provided.)
Software Dependencies	No	The paper mentions solving the LP (Linear Program) and refers to CPBVI and CALP as algorithms, but it does not specify versions for any ancillary software dependencies or solvers like CPLEX that might be used for LP solving.
Experiment Setup	Yes	5.1 Experimental Setup Finite State Domains For all the ﬁnite state domains, we used two kinds of structural priors, tied and semi . In both tied and semi, it is assumed that the transition dynamics are known except for the slip probabilities. We used uninformative Dirichlet priors. Approximate transitions in (10) were constructed by W(b \|bsas ) exp d(b , bsas ) where σ is set to 0.5. b B were collected by uniformly random policy executed in the environment for 50 time steps. In the chain and cliff domains, we report the results of 200 trials of the ﬁrst 2000 times steps with discount factor γ = 0.99. In the maze domain, we report the results of 100 tirals of the ﬁrst 1000 time steps with discount factor γ = 0.95. Continuous State Domain In cart pole, the environment dynamics are modeled as a linear dynamical system, as in [Tziortziotis et al., 2013]. We use matrix-normal prior for A and inverse-Wishart prior for V. To construct ﬁnite set of beliefs b S b B, we collected 10000 states by taking random actions and quantized them into 256 representative states via K-means clustering, which constitutes b S. Then, b B was constructed by gathering beliefs during the ﬁrst 100 steps at every 10 step-interval. W(b \|bsas ) was deﬁned by (13) with σ = 1. The results are averaged over 100 trials of the ﬁrst 10000 steps with discount factor γ = 0.99.