Provably Feedback-Efficient Reinforcement Learning via Active Reward Learning

Authors: Dingwen Kong, Lin Yang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we focus on addressing this issue from a theoretical perspective, aiming to provide provably feedback-efficient algorithmic frameworks that take human-in-the-loop to specify rewards of given tasks.
Researcher Affiliation Academia Dingwen Kong School of Mathematical Sciences Peking University dingwenk@pku.edu.cn Lin F. Yang Department of Electrical and Computer Engineering University of California, Los Angeles linyang@ee.ucla.edu
Pseudocode Yes Algorithm 1 Active Reward Learning(Z, , δ)
Open Source Code Yes The source code is included in the supplementary material. One may run Figure1.m and Figure2.m to reproduce the results in Figure 1 and Figure 2, respectively.
Open Datasets No We consider a tabular MDP with linear reward. The details of the experiments are deferred to Appendix A. Here we highlight three main points derived from the experiment.
Dataset Splits No We train for K = 2000 episodes for the first phase, and run 100 trials.
Hardware Specification No The amount of compute is negligible since the environment is very small. Our results can be easily reproduced in a personal laptop.
Software Dependencies No The source code is written in MATLAB.
Experiment Setup Yes The dimension of the linear MDP is d = 2. The horizon H = 5. The action space has |A| = 2 actions. The feature map ϕ(s, a) is defined as ϕ(s, a) = [s, a] where s is the state and a is the action. For the tabular case, we set the number of states S = 100.