Reward Shaping for Model-Based Bayesian Reinforcement Learning

Authors: Hyeoneun Kim, Woosang Lim, Kanghoon Lee, Yung-Kyun Noh, Kee-Eung Kim

AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted experiments on the following five benchmark BRL domains:
Researcher Affiliation Academia Hyeoneun Kim, Woosang Lim, Kanghoon Lee, Yung-Kyun Noh and Kee-Eung Kim Department of Computer Science Korea Advanced Institute of Science and Technology Daejeon 305-701, Korea hekim@ai.kaist.ac.kr, quasar17@kaist.ac.kr, khlee@ai.kaist.ac.kr, nohyung@kaist.ac.kr and kekim@cs.kaist.ac.kr
Pseudocode Yes Algorithm 1 The real-time heuristic search BRL algorithm, Algorithm 2 Expand(s, b), Algorithm 3 Update Ancestor(s, b)
Open Source Code No The paper links to the code for BAMCP (1https://github.com/acguez/bamcp), which is a baseline used for comparison, but does not provide concrete access to the source code for the methodology described in this paper.
Open Datasets Yes CHAIN (Strens 2000) consists of a linear chain of 5 states and 2 actions {a, b}, as shown in Figure 1 (a). DOUBLE-LOOP (Dearden, Friedman, and Russell 1998) consists of 9 states and 2 actions, as shown in Figure 1 (b). GRID5 (Guez, Silver, and Dayan 2012) consists of 5 5 states... GRID10 (Guez, Silver, and Dayan 2012) is a larger version of GRID5 with 10 10 states. MAZE (Dearden, Friedman, and Russell 1998) consists of 264 states and 4 actions...
Dataset Splits No The paper does not provide specific training/test/validation dataset splits, percentages, or sample counts. It describes experiments running for a certain number of timesteps.
Hardware Specification No The paper does not specify any particular hardware components such as GPU models, CPU models, or cloud computing instances used for running the experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., Python, PyTorch, or specific solver versions).
Experiment Setup Yes Each algorithm was given the CPU time of 0.1s per timestep by adjusting the number of node expansions. [...] for BAMCP, we followed the exact settings in (Guez, Silver, and Dayan 2012), which were c = 3 and ϵ = 0.5 for the exploration constants in the tree search and the rollout simulation, and the maximum depth of the search tree was set to 15 in all domains except GRID10 and MAZE, in which the depth was increased to 50; for ΦKMDP, we set the number of MDP samples K = 10; for ΦBEB, β was chosen from {0.5, 1, 10, 20, 30, 50} that performed the best; the recomputation of the potential function was set to happen 10 times during a run. [...] We set γ = 0.95 for all domains. Top performance results are highlighted in bold face. [...] In addition, we experimented with two different priors: flat Dirichlet multinomial (FDM) with α0 = 1/|S| (Guez, Silver, and Dayan 2012) and sparse factored Dirichlet multinomial (SFDM) (Friedman and Singer 1999).