Provably Feedback-Efficient Reinforcement Learning via Active Reward Learning
Authors: Dingwen Kong, Lin Yang
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we focus on addressing this issue from a theoretical perspective, aiming to provide provably feedback-efficient algorithmic frameworks that take human-in-the-loop to specify rewards of given tasks. |
| Researcher Affiliation | Academia | Dingwen Kong School of Mathematical Sciences Peking University dingwenk@pku.edu.cn Lin F. Yang Department of Electrical and Computer Engineering University of California, Los Angeles linyang@ee.ucla.edu |
| Pseudocode | Yes | Algorithm 1 Active Reward Learning(Z, , δ) |
| Open Source Code | Yes | The source code is included in the supplementary material. One may run Figure1.m and Figure2.m to reproduce the results in Figure 1 and Figure 2, respectively. |
| Open Datasets | No | We consider a tabular MDP with linear reward. The details of the experiments are deferred to Appendix A. Here we highlight three main points derived from the experiment. |
| Dataset Splits | No | We train for K = 2000 episodes for the first phase, and run 100 trials. |
| Hardware Specification | No | The amount of compute is negligible since the environment is very small. Our results can be easily reproduced in a personal laptop. |
| Software Dependencies | No | The source code is written in MATLAB. |
| Experiment Setup | Yes | The dimension of the linear MDP is d = 2. The horizon H = 5. The action space has |A| = 2 actions. The feature map ϕ(s, a) is defined as ϕ(s, a) = [s, a] where s is the state and a is the action. For the tabular case, we set the number of states S = 100. |