Bayesian Reinforcement Learning with Behavioral Feedback

Authors: Teakgyu Hong, Jongmin Lee, Kee-Eung Kim, Pedro A. Ortega, Daniel Lee

IJCAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 Experiments Experiments were conducted on the following four RL tasks: and Fig. 2 shows the results in the first 5000 time steps (12000 time steps for octopus arm). We can clearly observe that providing the agent with accurate feedback enables learning the optimal policy very quickly.
Researcher Affiliation Academia KAIST, Republic of Korea University of Pennsylvania, Pennsylvania, USA
Pseudocode Yes Algorithm 1: KTD posterior update and Algorithm 2: KTD update for reward and feedback
Open Source Code No The paper provides a link (https://code.google.com/archive/p/rl-competition/downloads) but it points to the simulation code for the 2009 RL competition which was used in their experiments, not the source code for their own methodology.
Open Datasets Yes Inverted pendulum [Lagoudakis and Parr, 2003], Mountain car [Sutton and Barto, 1998], Acrobot [Sutton and Barto, 1998], Octopus arm [Engel et al., 2005]
Dataset Splits No The paper describes continuous learning through interaction with the environment and evaluation over time steps and episodes, but does not provide explicit validation dataset splits as typically found in supervised learning contexts.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, memory amounts, or detailed computer specifications used for running its experiments.
Software Dependencies No The paper mentions 'LBFGS' and refers to 'the simulation code provided in the 2009 RL competition' but does not specify version numbers for any software dependencies.
Experiment Setup Yes Table 1: Experimental settings for each problem which includes Problem φ(s, a) dim(w) Prior w σr and Regarding action-selection strategy at each time step, we adopted a simple exploration policy motivated by Lin UCB [Li et al., 2010], where we select the action a... with c = 2 held fixed throughout time steps.