reproducibilityindex.ai

Bayesian Reinforcement Learning with Behavioral Feedback

Authors: Teakgyu Hong, Jongmin Lee, Kee-Eung Kim, Pedro A. Ortega, Daniel Lee

IJCAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 Experiments Experiments were conducted on the following four RL tasks: and Fig. 2 shows the results in the first 5000 time steps (12000 time steps for octopus arm). We can clearly observe that providing the agent with accurate feedback enables learning the optimal policy very quickly.
Researcher Affiliation	Academia	KAIST, Republic of Korea University of Pennsylvania, Pennsylvania, USA
Pseudocode	Yes	Algorithm 1: KTD posterior update and Algorithm 2: KTD update for reward and feedback
Open Source Code	No	The paper provides a link (https://code.google.com/archive/p/rl-competition/downloads) but it points to the simulation code for the 2009 RL competition which was used in their experiments, not the source code for their own methodology.
Open Datasets	Yes	Inverted pendulum [Lagoudakis and Parr, 2003], Mountain car [Sutton and Barto, 1998], Acrobot [Sutton and Barto, 1998], Octopus arm [Engel et al., 2005]
Dataset Splits	No	The paper describes continuous learning through interaction with the environment and evaluation over time steps and episodes, but does not provide explicit validation dataset splits as typically found in supervised learning contexts.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models, memory amounts, or detailed computer specifications used for running its experiments.
Software Dependencies	No	The paper mentions 'LBFGS' and refers to 'the simulation code provided in the 2009 RL competition' but does not specify version numbers for any software dependencies.
Experiment Setup	Yes	Table 1: Experimental settings for each problem which includes Problem φ(s, a) dim(w) Prior w σr and Regarding action-selection strategy at each time step, we adopted a simple exploration policy motivated by Lin UCB [Li et al., 2010], where we select the action a... with c = 2 held fixed throughout time steps.