The Sample Complexity of Teaching by Reinforcement on Q-Learning
Authors: Xuezhou Zhang, Shubham Bharti, Yuzhe Ma, Adish Singla, Xiaojin Zhu10939-10947
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We study the sample complexity of teaching, termed as teaching dimension (TDim) in the literature, for the teachingby-reinforcement paradigm, where the teacher guides the student through rewards. This is distinct from the teachingby-demonstration paradigm motivated by robotics applications, where the teacher teaches by providing demonstrations of state/action trajectories. The teaching-by-reinforcement paradigm applies to a wider range of real-world settings where a demonstration is inconvenient, but has not been studied systematically. In this paper, we focus on a speciļ¬c family of reinforcement learning algorithms, Q-learning, and characterize the TDim under different teachers with varying control power over the environment, and present matching optimal teaching algorithms. Our TDim results provide the minimum number of samples needed for reinforcement learning, and we discuss their connections to standard PAC-style RL sample complexity and teaching-by-demonstration sample complexity results. |
| Researcher Affiliation | Academia | Xuezhou Zhang1, Shubham Bharti1, Yuzhe Ma1, Adish Singla2 and Xiaojin Zhu1 1 UW Madison 2 MPI-SWS |
| Pseudocode | Yes | Algorithm 1 Machine Teaching Protocol on Q-learning |
| Open Source Code | No | The paper does not provide any statement or link regarding the availability of open-source code for the described methodology. |
| Open Datasets | No | The paper is theoretical and does not mention any specific dataset used for training or a link/citation to a public dataset. |
| Dataset Splits | No | The paper is theoretical and does not describe experimental validation sets or splits. |
| Hardware Specification | No | The paper is theoretical and does not describe specific hardware used for experiments. |
| Software Dependencies | No | The paper does not provide specific software names with version numbers for replication. |
| Experiment Setup | No | The paper is theoretical and describes conceptual parameters of MDPs and Q-learning, but does not provide specific experimental setup details like hyperparameter values for an empirical study. |